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Preface 


The best algorithm designers prove both possibility and impossibility resnlts — both npper 
and lower bounds. For example, every serious computer scientist knows a collection of 
canonical NP-complete problems and how to reduce them to other problems of interest. 
Communication complexity offers a clean theory that is extremely useful for proving lower 
bonnds for lots of different fundamental problems. Many of the most significant algorithmic 
consequences of the theory follow from its most elementary aspects. 

This document collects the lecture notes from my course “Communication Complexity 
(for Algorithm Designers),” taught at Stanford in the winter quarter of 2015. The two 
primary goals of the course are: 

(1) Learn several canonical problems in communication complexity that are useful for 
proving lower bounds for algorithms (DiSJOINTNESS, INDEX, Gap-Hamming, etc.). 

(2) Learn how to reduce lower bounds for fundamental algorithmic problems to communi¬ 
cation complexity lower bounds. 

Along the way, we’ll also: 

(3) Get exposure to lots of cool computational models and some famous results about 
them — data streams and linear sketches, compressive sensing, space-query time 
trade-offs in data structures, sublinear-time algorithms, and the extension complexity 
of linear programs. 

(4) Scratch the surface of techniques for proving communication complexity lower bounds 
(fooling sets, corruption arguments, etc.). 

Readers are assumed to be familiar with undergraduate-level algorithms, as well as the 
statements of standard large deviation inequalities (Markov, Chebyshev, and Chernoff- 
Hoeffding). 

The course begins in Lectures mil with the simple case of one-way communication 
protocols — where only a single message is sent — and their relevance to algorithm 
design. Each of these lectures depends on the previous one. Many of the “greatest hits” of 
communication complexity applications, including lower bounds for small-space streaming 
algorithms and compressive sensing, are already implied by lower bounds for one-way 
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protocols. Reasoning about one-way protocols also provides a gentle warm-up to the 
standard model of general two-party communication protocols, which is the subject of 
Lecture]^ Lectures |5}|^ translate communication complexity lower bounds into lower bounds 
in several disparate problem domains: the extension complexity of polytopes, data structure 
design, algorithmic game theory, and property testing. Each of these hnal four lectures 
depends only on Lecture]^ 

The course Web page (http://theory.stanford.edu/~tiin/wl5/wl5.html) contains 
links to relevant large deviation inequalities, links to many of the papers cited in these notes, 
and a partial list of exercises. Lecture notes and videos on several other topics in theoretical 
computer science are available from my Stanford home page. 

I am grateful to the Stanford students who took the course, for their many excellent 
questions: Josh Ahnan, Dylan Cable, Brynmor Chapman, Michael Kim, Arjun Puranik, 
Okke Schrijvers, Nolan Skochdopole, Dan Stubbs, Joshua Wang, Huacheng Yu, Lin Zhai, and 
several auditors whose names I’ve forgotten. I am also indebted to Alex Andoni, Parikshit 
Gopalan, Ankur Moitra, and C. Seshadhri for their advice on some of these lectures. The 
writing of these notes was supported in part by NSF award CCF-1215965. 

I always appreciate suggestions and corrections from readers. 


Tim Roughgarden 

474 Gates Building, 353 Serra Mall 

Stanford, CA 94305 

Email: tim@cs.stanford.edu 

WWW: http: //theory, stanford.edu/~tim/ 
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Lecture 1 


Data Streams: Algorithms and Lower Bounds 


1.1 Preamble 


This class is mostly about impossibility results — lower bounds on what can be accom¬ 
plished by algorithms. However, our perspective will be unapologetically that of an algorithm 
designer^ We’ll learn lower bound technology on a “need-to-know” basis, guided by funda¬ 
mental algorithmic problems that we care about (perhaps theoretically, perhaps practically). 
That said, we will wind up learning quite a bit of complexity theory — specifically, commu¬ 
nication complexity — along the way. We hope this viewpoint makes this course and these 



communication complexity lower bounds also provides a convenient excuse to take a guided 
tour of numerous models, problems, and algorithms that are central to modern research in 
the theory of algorithms but missing from many algorithms textbooks: streaming algorithms, 
space-time trade-offs in data structures, compressive sensing, sublinear algorithms, extended 
formulations for linear programs, and more. 

Why should an algorithm designer care about lower bounds? The best mathematical 
researchers can work on an open problem simultaneously from both sides. Even if you have 
a strong prior belief about whether a given mathematical statement is true or false, failing 
to prove one direction usefully informs your efforts to prove the other. (And for most us, the 
prior belief is wrong surprisingly often!) In algorithm design, working on both sides means 
striving simultaneously for better algorithms and for better lower bounds. For example, 
a good undergraduate algorithms course teaches you both how to design polynomial-time 
algorithms and how to prove that a problem is A^P-complete — since when you encounter a 
new computational problem in your research or workplace, both are distinct possibilities. 
There are many other algorithmic problems where the fundamental difficulty is not the 
amount of time required, but rather concerns communication or information transmission. 
The goal of this course is to equip you with the basic tools of communication complexity — 
its canonical hard problems, the canonical reductions from computation in various models to 


^Alr eady in th i s lectu re, over half our discussion will be about algorithms and upper bounds! 

■^See Patra§cu (20091 for a series of four blog posts on data structures that share some spirit with our 


approach. 
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Data Streams: Algorithms and Lower Bounds 


the design of low-communication protocols, and a little bit about its lower bound techniques 
— in the service of becoming a better algorithm designer. 


This lecture and the next study the data stream model of computation. There are some 
nice upper bounds in this model (see Sections 1.4 and 1.51, and the model also naturally 
motivates a severe but useful restriction of the general communication complexity setup 
(Section |1.7[ ). We’ll cover many computational models in the course, so whenever you get 
sick of one, don’t worry, a new one is coming up around the corner. 


1.2 The Data Stream Model 

The data stream model is motivated by applications in which the input is best thought 
of as a firehose — packets arriving to a network switch at a torrential rate, or data being 
generated by a telescope at a rate of one exobyte per day. In these applications, there’s no 
hope of storing all the data, but we’d still like to remember useful summary statistics about 
what we’ve seen. 

Alternatively, for example in database applications, it could be that data is not thrown 
away but resides on a big, slow disk. Rather than incurring random access costs to the 
data, one would like to process it sequentially once (or a few times), perhaps overnight, 
remembering the salient features of the data in a limited main memory for real-time use. 
The daily transactions of Amazon or Walmart, for example, could fall into this category. 

Formally, suppose data elements belong to a known universe U = {l,2,...,n}. The 
input is a stream xi,X 2 , ■ ■ ■, Xm € U of elements that arrive one-by-one. Our algorithms will 
not assume advance knowledge of m, while our lower bounds will hold even if m is known a 
priori. With space ~ m log 2 n, it is possible to store all of the data. The central question 
in data stream algorithms is: what is possible, and what is impossible, using a one-pass 
algorithm and much less than m log n space? Ideally, the space usage should be sublinear or 
even logarithmic in n and m. We’re not going to worry about the computation time used by 
the algorithm (though our positive results in this model have low computational complexity, 
anyway). 

Many of you will be familiar with a streaming or one-pass algorithm from the following 
common interview question. Suppose an array A, with length m, is promised to have a 
majority element — an element that occurs strictly more than m/2 times. A simple one-pass 
algorithm, which maintains only the current candidate majority element and a counter for it 
— so 0(logn -|- logm) bits — solves this problem. (If you haven’t seen this algorithm before, 
see the Exercises.) This can be viewed as an exemplary small-space streaming algorithmj^ 


■^Interestingly, the promise that a majority element exists is crucial. A consequence of the next lecture is 
that there is no small-space streaming algorithm to check whether or not a majority element exists! 






1.3 Frequency Moments 
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1.3 Frequency Moments 


Next we introduce the canonical problems in the field of data stream algorithms: computing 
the frequency moments of a stream. These were studied in the paper that kickstarted the 


field (Alon et al. 19991, and the data stream algorithms community has been obsessed with 


them ever since. 

Fix a data stream xi, X 2 , • • •, Xm £ U. For an element j G U, let fj G {0,1,2,..., m} 
denote the number of times that j occurs in the stream. For a non-negative integer k, the 
kth frequency moment of the stream is 


j&U 


( 1 . 1 ) 


Note that the bigger k is, the more the sum in (1.1) is dominated by the largest frequencies. 
It is therefore natural to dehne 


maxfj 

jeu 


as the largest frequency of any element of U. 

Let’s get some sense of these frequency moments. Fi is boring — since each data 
stream element contributes to exactly one frequency fj, Fi = Yljeu fj simply the stream 
length m. Fq is the number of distinct elements in the stream (we’re interpreting 0° = 0) — 
it’s easy to imagine wanting to compute this quantity, for example a network switch might 
want to know how many different TCP flows are currently going through it. Too is the 
largest frequency, and again it’s easy to imagine wanting to know this — for example to 
detect a denial-of-service attack at a network switch, or identify the most popular product 
on Amazon yesterday. Note that computing Too is related to the aforementioned problem of 
detecting a majority element. Finally, T 2 = fi i® sometimes called the “skew” of the 

data — it is a measure of how non-uniform the data stream is. In a database context, it 
arises naturally as the size of a “self-join” — the table you get when you join a relation with 
itself on a particular attribute, with the /j’s being the frequencies of various values of this 
attribute. Having estimates of self-join (and more generally join) sizes at the ready is useful 
for query optimization, for example. We’ll talk about T 2 extensively in the next section]^ 

It is trivial to compute all of the frequency moments in 0(m log n) space, just by storing 
the Xj’s, or in O(nlogm), space, just by computing and storing the /^’s (a logm-bit counter 
for each of the n universe elements). Similarly, Ti is trivial to compute in O(logm) space 
(via a counter), and Tq in 0{n) space (via a bit vector). Can we do better? 

Intuitively, it might appear difficult to improve over the trivial solution. For Tq, for 
example, it seems like you have to know which elements you’ve already seen (to avoid 
double-counting them), and there’s an exponential (in n) number of different possibilities for 

■'The problem of computing F 2 and the solution we give for it are also quite well connected to other 
important concepts, such as compressive sensing and dimensionality reduction. 
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Data Streams: Algorithms and Lower Bounds 


what you might have seen in the past. As we’ll see, this is good intuition for deterministic 
algorithms, and for (possibly randomized) exact algorithms. Thus, the following positive 
result is arguably surprising, and very coolj^ 


Theorem 1.1 (Alon et al. 1999) Both Fq and F 2 can be approximated, to within a (l±e) 


factor with probability at least 1 — (5, in space 


O ((e ^(logn + logm) log j) . 


( 1 . 2 ) 


Theorem o refers to two different algorithms, one for Fq and one for F 2 . We cover the 


latter in detail below. Section 1.5 describes the high-order bit of the Tq algorithm, which is 


a modification of the earlier algorithm of Flajolet and Martin (19831, with the details in 


the exercises. Both algorithms are randomized, and are approximately correct (to within 
(1 ± e)) most of the time (except with probability <5). Also, the logm factor in ( |1.2[ ) is not 
needed for the Fq algorithm, as you might expect. Some further optimization are possible; 
see Section 11.4.31 

The first reason to talk about Theorem o is that it’s a great result in the field of 
algorithms — if you only remember one streaming algorithm, the one below might as well 
be the onej^ You should never tire of seeing clever algorithms that radically outperform the 
“obvious solution” to a well-motivated problem. And Theorem 1 1.1 1 should serve as inspiration 
to any algorithm designer — even when at first blush there is no non-trivial algorithm for 
problem in sight, the right clever insight can unlock a good solution. 

On the other hand, there unfortunately are some important problems out there with no 
non-trivial solutions. And it’s important for the algorithm designer to know which ones they 
are — the less effort wasted on trying to find something that doesn’t exist, the more energy 
is available for formulating the motivating problem in a more tractable way, weakening the 
desired guarantee, restricting the problem instances, and otherwise finding new ways to 
make algorithmic progress. A second interpretation of Theorem |l.l| is that it illuminates why 
such lower bounds can be so hard to prove. A lower bound is responsible for showing that 
every algorithm, even fiendishly clever ones like those employed for Theorem [13 cannot 
make significant inroads on the problem. 


1.4 Estimating F 2 : The Key Ideas 


In this section we give a nearly complete proof of Theorem 
estimation (a few details are left to the Exercises). 


1.1 


for the case of F 2 = ’Ylj&u f‘. 


®The Alon-Matias-Szegedy paper ( Alon et al.[|l99^ ignited the field of streaming algorithms as a hot 
area, and for this reason won the 2005 Godel Prize (a “test of time”-type award in theoretical computer 
science). The paper includes a number of other upper and lower bounds as well, some of which we’ll cover 
shortly. 

^Either algorithm, for estimating Fq or for F 2 , could serve this purpose. We present the F 2 estimation 
algorithm in detail, because the analysis is slightly slicker and more canonical. 
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1.4.1 The Basic Estimator 

The high-level idea is very natural, especially once you start thinking about randomized 
algorithms. 


1. Dehne a randomized unbiased estimator of F 2 , which can be computed in one pass. 
Small space seems to force a streaming algorithm to lose information, but maybe it’s 
possible to produce a result that’s correct “on average.” 

2. Aggregate many independent copies of the estimator, computed in parallel, to get an 
estimate that is very accurate with high probability. 


This is very hand-wavy, of course — does it have any hope of working? It’s hard to answer 
that question without actually doing some proper computations, so let’s proceed to the 
estimator devised in Alon et al. (1999). 

The Basic EstimatorJ^ 


1. Let h:C/—>-{±l}bea function that associates each universe element with a random 
sign. On a hrst reading, to focus on the main ideas, you should assume that h is a 
totally random function. Later we’ll see that relatively lightweight hash functions are 
good enough (Section 1.4.2), which enables a small-space implementation of the basic 
ideas. 


2. Initialize Z = 0. 

3. Every time a data stream element j ^ U appears, add h{j) to Z. That is, increment 
Z if h{j) = -|-1 and decrement Z if h{j) = — ij^ 

4. Return the estimate X = Z^. 


Remark 1.2 A crucial point: since the function h is fixed once and for all before the data 
stream arrives, an element j € U is treated consistently every time it shows up. That is, Z 
is either incremented every time j shows up or is decremented every time j shows up. In 
the end, element j contributes h{j)fj to the final value of Z. 

First we need to prove that the basic estimator is indeed unbiased. 

Lemma 1.3 For every data stream, 


Eh[X] = F2. 

^This is sometimes called the “tug-of-war” estimator. 

®This is the “tug of war,” between elements j with h{j) = -|-1 and those with h{j) = —1. 
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Proof: We have 


E[X] 


E[Z^ 


E 




\jeu 


jeu j<£ 

E/i 

j&u j<i ' ' 

=F2 

F2, 


(1.3) 

(1.4) 


where in (1.3) we use linearity of expectation and the fact that h{j) G {±1} for every j, and 
in (1.4) we use the fact that, for every distinct j,i, all four sign patterns for {h{j), h{i)) are 
equally likely. ■ 


Note the reason for both incrementing and decrementing in the running sum Z — it 
ensures that the “cross terms” h(j)h{£)fjfi in our basic estimator X = Z^ cancel out in 
expectation. Also note, for future reference, that the only time we used the assumption that 


h is a totally random function was in (1.4), and we only used the property that all four sign 
patterns for {h{j), h{i)) are equally likely — that h is “pairwise independent.” 

Lemma [l.3| is not enough for our goal, since the basic estimator X is not guaranteed to 
be close to its expectation with high probability. A natural idea is to take the average of 
many independent copies of the basic estimator. That is, we’ll use t independent functions 
hi, /i 2 , ■ ■ ■ ,ht : U —7- {±1} to define estimates Xi ,..., Xf. On the arrival of a new data 
stream element, we update all t of the counters Zi,... ,Zt appropriately, with some getting 
incremented and others decremented. Our hnal estimate will be 


Y = 


1 




Z=1 


Since the Xfs are unbiased estimators, so is Y (i.e., = F 2 ). To see how quickly 

the variance decreases with the number t of copies, note that 


Var[y] = Var 



2=1 


1 . * 

^^Var[W] 


Var [A] 


where X denotes a single copy of the basic estimator. That is, averaging reduces the variance 
by a factor equal to the number of copies. Unsurprisingly, the number of copies t (and in 
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the end, the space) that we need to get the performance guarantee that we want is governed 
by the variance of the basic estimator. So there’s really no choice but to roll up our sleeves 
and compute it. 

Lemma 1.4 For every data stream, 


VarniX] < 2F| 


Lemma 1.4 states the standard deviation of the basic estimator is in the same ballpark 
as its expectation. That might sound ominous, but it’s actually great news — a constant 
(depending on e and 5 only) number of copies is good enough for our purposes. Before 


proving Lemma 1.4, let’s see why. 


Corollary 1.5 For every data stream, with t = 


Prh,_hAY ^ {I ±e)-F2] >1-6. 

Proof: Recall that Chebyshev’s inequality is the inequality you want when bounding the 
deviation of a random variable from its mean parameterized by the number of standard 
deviations. Formally, it states that for every random variable Y with hnite hrst and second 
moments, and every c > 0, 


Pr[|y - E[y] I > c] < 


Var[y] 


(1.5) 


Note that (1.5) is non-trivial (i.e., probability less than 1) once c exceeds T’s standard devi¬ 
ation, and the probability goes down quadratically with the number of standard deviations. 
It’s a simple inequality to prove; see the separate notes on tail inequalities for details. 

We are interested in the case where Y is the average of t basic estimators, with variance 
Since we want to guarantee a (1 ± e)-approximation, the deviation c of 


as in Lemma 1.4 


Lemma 


1.4 


interest to us is eF 2 - We also want the right-hand side of (1.5) to be equal to 6. Using 
and solving, we get t = 2/ e^6^M 


We now stop procrastinating and prove Lemma 1.4 


Proof of Lemma 1.4: Recall that 


/ 


Var[X] = E[X‘^] - 


E[X] 


( 1 . 6 ) 


y=_F 2 by Lemma |l .31 / 


^The dependence on | can be decreased to logarithmic; see Section 1.4.3 
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Zooming in on the E [A"^] term, recall that X is already dehned as the square of the running 
sum Z, so X'^ is Thus, 


E[X‘^] = E 


Kj)fj 


(1.7) 


Expanding the right-hand side of (1.7) yields \U\^ terms, each of the form 
h{ji)h{j 2 )h{js)h{j 4 )fjj^fj 2 fj 3 fj^. (Remember: the /i-values are random, the /-values are 
fixed.) This might seem unwieldy. But, just as in the computation (1.4) in the proof of 
Lemma |l.3t most of these are zero in expectation. For example, suppose ji, j4 are 


distinct. Condition on the h-values of the hrst three. Since /i(j 4 ) is equally likely to be +1 
or -1, the conditional expected value (averaged over the two cases) of the corresponding 
term is 0. Since this holds for all possible values of h{ji),h{j 2 ),h{j^), the unconditional 
expectation of this term is also 0. This same argument applies to any term in which some 
element / G ?7 appears an odd number of times. Thus, when the dust settles, we have 


E[X^] 


E 




j&u j<e 


( 1 . 8 ) 


where the “6” appears because a given h{j)'^h{i)‘^fj f'^ term with j < i arises in ( 2 ) = 6 
different ways. 

Expanding terms, we see that 


j&U j<£ 


and hence 


E[X‘^] < 3E|. 


Recalling (1.6) proves that Var[X] < 2E/, as claimed. 


Looking back over the proof of Lemma 


1.4, we again see that we only used the fact 


that h is random in a limited way. In (1.8) we used that, for every set of four distinct 


universe elements, their 16 possible sign patterns (under h) were equally likely. (This implies 
the required property that, if j appears in a term an odd number of times, then even after 
conditioning on the /i-values of all other universe elements in the term, h{j) is equally likely 
to be +1 or -1.) That is, we only used the “4-wise independence” of the function h. 
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1.4.2 Small-Space Implementation via 4-Wise Independent Hash Functions 

Let’s make sure we’re clear on the final algorithm. 

1. Choose functions hi,... ,ht : U ^ where t = 

2. Initialize Zi = 0 for i = 1,2,... ,t. 

3. When a new data stream element j € U arrives, add hi{j) to Zi for every i = 1,2,... ,t. 

4. Return the average of the Zf's. 


Last section proved that, if the /ij’s are chosen uniformly at random from all functions, then 
the output of this algorithm lies in (1 ± e)F 2 with probability at least 1 — 5. 

How much space is required to implement this algorithm? There’s clearly a factor of 
since we’re effectively running this many streaming algorithms in parallel, and each needs 
its own scratch space. How much space does each of these need? To maintain a counter Zi, 
which always lies between —m and m, we only need O(logm) bits. But it’s easy to forget 
that we have to also store the function hi. Recall from Remark ] 1.2 [ the reason: we need to 
treat an element j & U consistently every time it shows up in the data stream. Thus, once 
we choose a sign for j we need to remember it forevermore. Implemented naively, with 
hi a totally random function, we would need to remember one bit for each of the possibly 
f2(n) elements that we’ve seen in the past, which is a dealbreaker. 

Fortunately, as we noted after the proofs of Lemmas 1.3 and 1.4 our entire analysis has 
relied only on 4-wise independence — that when we look at an arbitrary 4-tuple of universe 
elements, the projection of h on their 16 possible sign patterns is uniform. (Exercise: go 
back through this section in detail and double-check this claim.) And happily, there are 
small families of simple hash functions that possess this property. 


Fact 1.6 For every universe U with n = |17|, there is a family % of 4-wise independent 
hash functions (from U to {±1 }) with size polynomial in n. 


Fact 1.6 and our previous observations imply that, to enjoy an approximation of (lie) 
with probability at least 1 — 5, our streaming algorithm can get away with choosing the 
functions hi,... ,ht uniformly and independently from Ti. 

If you’ve never seen a construction of a A:-wise independent family of hash functions 
with k > 2, check out the Exercises for details. The main message is to realize that you 
shouldn’t be scared of them — heavy machinery is not required. For example, it suffices 
to map the elements of U injectively into a suitable finite field (of size roughly |C/|), and 
then let H be the set of all cubic polynomials (with all operations occurring in this field). 
The final output is then +1 if the polynomial’s output (viewed as an integer) is even, and 
-1 otherwise. Such a hash function is easily specified with O(logn) bits (just list its four 
coefficients), and can also be evaluated in O(logre) space (which is not hard, but we won’t 
say more about it here). 
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Putting it all together, we get a space bound of 


/ 


O 


1 




^ • I log m + log n 

L \counter hash function/ 

XTf of copies 




(1.9) 


1.4.3 Further Optimizations 


The bound in (1.9) is worse than that claimed in Theorem 


1.1 


with a dependence on ^ 
we averaged t 


1.4.1 


instead of log j. A simple trick yields the better bound. In Section 
copies of the basic estimator to accomplish two conceptually different things: to improve the 
approximation ratio to (lie), for which we suffered an ^ factor, and to improve the success 
probability to 1 — h, for which we suffered an additional j. It is more efficient to implement 
these improvements one at a time, rather than in one shot. The smarter implementation 
hrst uses ~ ^ copies to obtain an approximation of (lie) with probably at least | (say). 
To boost the success probability from | to 1 — h, it is enough to run ~ log ^ different copies 
of this solution, and then take the median of their ~ log ^ different estimates. Since we 
expect at least two-thirds of these estimates to lie in the interval (1 i e)F 2 , it is very likely 
that the median of them lies in this interval. The details are easily made precise using a 
Chernoff bound argument; see the Exercises for details. 

Second, believe it or not, the logm term in Theorem o can be improved to log logm. 
The reason is that we don’t need to count the Zj’s exactly, only approximately and with 
high probability. This relaxed counting problem can be solved using Morris’s algorithm, 
which can be implemented as a streaming algorithm that uses 0{e~^ log logm log ^) space. 
See the Exercises for further details. 


1.5 Estimating Fq: The High-Order Bit 


Recall that Fq denotes the number of distinct elements present in a data stream. The 
high-level idea of the Fq estimator is the same as for the F 2 estimator above. The steps are to 
dehne a basic estimator that is essentially unbiased, and then reduce the variance by taking 



shows up in the data stream. 


rather, a simple hash function with the salient properties of a random permutation. 
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Why use the minimum? One intuition comes from the suggestive match between the 
idempotence of Fo and of the minimum — adding duplicate copies of an element to the 
input has no effect on the answer. 

Given the minimum /i(x)-value in the data stream, how do we extract from it an estimate 
of Fq, the number of distinct elements? For intuition, think about the uniform distribution 
on [0,1] (Figure 1.11. Obviously, the expected value of one draw from the distribution is 
For two i.i.d. draws, simple calculations show that the expected minimum and maximum 
are | and |, respectively. More generally, the expected order statistics of k i.i.d. draws split 
the interval into A: + 1 segments of equal length. In particular, the expected minimum is 
In other words, if you are told that some number of i.i.d. draws were taken from the 
uniform distribution on [0,1] and the smallest draw was c, you might guess that there were 
roughly 1/c draws. 


< - © - • - • - • - © 
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Figure 1.1 The expected order statistics of i.i.d. samples from the uniform distribution on the unit 
interval are spaced out evenly over the interval. 


Translating this idea to our basic Fq estimator, if there are k distinct elements in 
a data stream xi,... ,Xm, then there are k different (random) hash values h{xi), and 
we expect the smallest of these to be roughly \U\/k. This leads to the basic estimator 
X = |17|/(minT;^ h(xi)). Using averaging and an extra idea to reduce the variance, and 
medians to boost the success probability, this leads to the bound claimed in Theorem o 
(without the logm term). The details are outlined in the exercises. 

Remark 1.7 You’d be right to ask if this high-level approach to probabilistic and approxi¬ 
mate estimation applies to all of the frequency moments, not just Fq and F2. The approach 
can indeed be used to estimate Fk for all k. However, the variance of the basic estimators 
will be different for different frequency moments. For example, as k grows, the statistic F}^ 
becomes quite sensitive to small changes in the input, resulting in probabilistic estimators 
with large variance, necessitating a large number of independent copies to obtain a good 
approximation. More generally, no frequency moment Fk with k 0 {0,1, 2} can be computed 
using only a logarithmic amount of space (more details to come). 





12 


Data Streams: Algorithms and Lower Bounds 


1.6 Can We Do Better? 

Theorem o is a fantastic result. But a good algorithm designer is never satished, and 
always wants more. So what are the weaknesses of the upper bounds that we’ve proved so 
far? 

1. We only have interesting positive results for Fq and F 2 (and maybe Fi, if you want to 
count that). What about for k > 2 and k = 00 ? 

2. Our To and F 2 algorithms only approximate the corresponding frequency moment. 
Can we compute it exactly, possibly using a randomized algorithm? 

3. Our To and F 2 algorithms are randomized, and with probability 5 fail to provide a 
good approximation. (Also, they are Monte Carlo algorithms, in that we can’t tell 
when they fail.) Can we compute Tq or F 2 deterministically, at least approximately? 

4. Our To and F 2 algorithms use ^(logn) space. Can we reduce the dependency of the 
space on the universe sizelf^ 

5. Our To and T 2 algorithms use H(e“^) space. Can the dependence on e~^ be improved? 
The e~^ dependence can be painful in practice, where you might want to take e = .01, 
resulting in an extra factor of 10,000 in the space bound. An improvement to ~ e“^, 
for example, would be really nice. 

Unfortunately, we can’t do better — the rest of this lecture and the next (and the exercises) 
explain why all of these compromises are necessary for positive results. This is kind of 
amazing, and it’s also pretty amazing that we can prove it without overly heavy machinery. 
Try to think of other basic computational problems where, in a couple hours of lecture 
and with minimal background, you can explain complete proofs of both a non-trivial upper 
bound and an unconditional (independent of P vs. NP, etc.) matching lower bound 


1.7 One-Way Communication Complexity 

We next describe a simple and clean formalism that is extremely useful for proving lower 
bounds on the space required by streaming algorithms to perform various tasks. The model 
will be a quite restricted form of the general communication model that we study later 
— and this is good for us, because the restriction makes it easier to prove lower bounds. 
Happily, even lower bounds for this restricted model typically translate to lower bounds for 
streaming algorithms. 

^^This might seem like a long shot, but you never know. Recall onr comment about reducing the space 
dependency on m from O(logm) to O(loglogm) via probabilistic approximate counters. 

^■^OK, comparison-based sorting, sure. And we’ll see a couple others later in this course. But I don’t 
know of that many examples! 
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In general, communication complexity is a sweet spot. It is a general enough concept to 
capture the essential hardness lurking in many different models of computation, as we’ll see 
throughout the course. At the same time, it is possible to prove numerous different lower 
bounds in the model — some of these require a lot of work, but many of the most important 
ones are easier that you might have guessed. These lower bounds are “unconditional” — 
they are simply true, and don’t depend on any unproven (if widely believed) conjectures like 
P 7 ^ NP. Finally, because the model is so clean and free of distractions, it naturally guides 
one toward the development of the “right” mathematical techniques needed for proving new 
lower bounds. 

In (two-party) communication complexity, there are two parties, Alice and Bob. Alice 
has an input x G {0,1}“, Bob an input y G {0,1}^. Neither one has any idea what the 
other’s input is. Alice and Bob want to cooperate to compute a Boolean function (i.e., a 
predicate) / : {0,1}“ x {0,1}^ —)• {0,1} that is defined on their joint input. We’ll discuss 
several examples of such functions shortly. 

For this lecture and the next, we can get away with restricting attention to one-way 
communication protocols. All that is allowed here is the following: 


1. Alice sends Bob a message z, which is a function of her input x only. 

2. Bob declares the output /(x, y), as a function of Alice’s message z and his input y 
only. 


Since we’re interested in both deterministic and randomized algorithms, we’ll discuss both 
deterministic and randomized one-way communication protocols. 

The one-way communication complexity of a Boolean function / is the minimum worst- 
case number of bits used by any one-way protocol that correctly decides the function. (Or 
for randomized protocols, that correctly decides it with probability at least 2/3.) That is, it 
is 

min max{length (in bits) of Alice’s message z when Alice’s input is x}, 

V x,y 

where the minimum ranges over all correct protocols. 

Note that the one-way communication complexity of a function / is always at most a, 
since Alice can just send her entire a-bit input x to Bob, who can then certainly correctly 
compute /. The question is to understand when one can do better. This will depend on 
the specific function /. For example, if / is the parity function (i.e., decide whether the 
total number of Is in (x, y) is even or odd), then the one-way communication complexity of 
/ is 1 (Alice just sends the parity of x to Bob, who’s then in a position to figure out the 
parity of (x,y)). 
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1.8 Connection to Streaming Algorithms 

If you care about streaming algorithms, then you should also care about one-way communi¬ 
cation complexity. Why? Because of the unreasonable effectiveness of the following two-step 
plan to proving lower bounds on the space usage of streaming algorithms. 

1. Small-space streaming algorithms imply low-communication one-way protocols. 

2. The latter don’t exist. 

Both steps of this plan are quite doable in many cases. 

Does the connection in the first step above surprise you? It’s the best kind of statement 
— genius and near-trivial at the same time. We’ll be formal about it shortly, but it’s worth 
remembering a cartoon meta-version of the connection, illustrated in Figure [L^ Consider 
a problem that can be solved using a streaming algorithm S that uses space only s. How 
can we use it to define a low-communication protocol? The idea is for Alice and Bob to 
treat their inputs as a stream (x,y), with all of x arriving before all of y. Alice can feed 
X into S without communicating with Bob (she knows x and S). After processing x, S'’s 
state is completely summarized by the s bits in its memory. Alice sends these bits to Bob. 
Bob can then simply restart the streaming algorithm S seeded with this initial memory, 
and then feed his input y to the algorithm. The algorithm S winds up computing some 
function of (x, y), and Alice only needs to communicate s bits to Bob to make it happen. 
The communication cost of the induced protocol is exactly the same as the space used by 
the streaming algorithm. 



Figure 1.2 Why a small-space streaming algorithm induces a low-communication one-way protocol. 
Alice runs the streaming algorithm on her input, sends the memory contents of the algorithm to 
Bob, and Bob resumes the execution of the algorithm where Alice left off on his input. 


1.9 The Disjointness Problem 

To execute the two-step plan above to prove lower bounds on the space usage of streaming 
algorithms, we need to come up with a Boolean function that (i) can be reduced to a 
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streaming problem that we care about and (ii) does not admit a low-communication one-way 
protocol. 


1.9.1 Disjointness Is Hard for One-Way Communication 

If you only remember one problem that is hard for communication protocols, it should 
be the DiSJOINTNESS problem. This is the canonical hard problem in communication 
complexity, analogous to satisfiability (SAT) in the theory of AiP-completeness. We’ll see 
more reductions from the DiSJOINTNESS problem than from any other in this course. 

In an instance of DiSJOINTNESS, both Alice and Bob hold n-bit vectors x and y. We 
interpret these as characteristic vectors of two subsets of the universe {1, 2 ,..., n}, with the 
subsets corresponding to the “1” coordinates. We then define the Boolean function DISJ in 
the obvious way, with DISJ{x, y) = 0 if there is an index i G {1, 2, ..., n} with Xi = yi = 1, 
and DISJ{x,y) = 1 otherwise. 

To warm up, let’s start with an easy result. 

Proposition 1.8 Every deterministic one-way communication protocol that computes the 
function DISJ uses at least n bits of communication in the worst case. 

That is, the trivial protocol is optimal among deterministic protocolsj^ The proof follows 
pretty straightforwardly from the Pigeonhole Principle — you might want to think it through 
before reading the proof below. 

Formally, consider any one-way communication protocol where Alice always sends at 
most n — 1 bits. This means that, ranging over the 2” possible inputs x that Alice might 
have, she only sends 2”'“^ distinct messages. By the Pigeonhole Principle, there are distinct 
messages x^ and x^ where Alice sends the same message z to Bob. Poor Bob, then, has 
to compute DISJ{x, y) knowing only z and y and not knowing x — x could be x^, or it 
could be x^. Letting i denote an index in which x^ and x^ differ (there must be one). Bob is 
really in trouble if his input y happens to be the fth basis vector (all zeroes except yi = 1). 
For then, whatever Bob says upon receiving the message z, he will be wrong for exactly one 
of the cases x = x^ or x = x^. We conclude that the protocol is not correct. 

A stronger, and more useful, lower bound also holds. 

Theorem 1.9 Every randomized one-way protocorDthat, for every input (x, y), correctly 
decides the function DISJ with probability at least ^uses D(n) communication in the worst 
case. 

see later that the communication complexity remains n even when we allow general communication 

protocols. 

^■‘There are different flavors of randomized protocols, such as “public-coin” vs. “private-coin” versions. 
These distinctions won’t matter until next lecture, and we elaborate on them then. 
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The probability in Theorem 1.9 is over the coin flips performed by the protocol (there is no 

There’s nothing special about the constant 
it can be replaced by any constant strictly larger than 


randomness in the input, which is “worst-case” 
i in the statement of Theorem 


1.9 


2 - 


Theorem 1.9 is certainly harder to prove than Proposition 1.8 but it’s not too bad — 
we’ll kick off next lecture with a proofj^ For the rest of this lecture, we’ll take Theorem 1.9 
on faith and use it to derive lower bounds on the space needed by streaming algorithms. 


1.9.2 Space Lower Bound for Fqo (even with Randomization and 

Approximation) 

Recall from Section |1.6| that the first weakness of Theorem |1.1| is that it applies only to Fq 


and F 2 (and Fi is easy). The next result shows that, assuming Theorem 1.9 there is no 
sublinear-space algorithm for computing Too, even probabilistically and approximately. 


Theorem 1.10 (Alon et al. 1999) Every randomized streaming algorithm that, for every 
data stream of length m, computes Too to within a (1 ± .2) factor with probability at least 
2/3 uses space n(min{m,n}). 


Theorem 1 1.1 0| rules out, in a strong sense, extending our upper bounds for To,Ti,T 2 to 
all Tfc. Thus, the different frequency moments vary widely in tractability in the streaming 
model 


Proof of Theorem ] The proof simply implements the cartoon in Figure 1.2 with the 
problems of computing Too (in tbe streaming model) and DiSJOINTNESS (in the one-way 
communication model). In more detail, let S be a space-s streaming algorithm that for 
every data stream, with probability at least 2/3, outputs an estimate in (1 ± .2)Too. Now 
consider the following one-way communication protocol V for solving the DiSJOINTNESS 
problem (given an input (x, y)): 

1. Alice feeds into S the indices i for which Xj = 1; the order can be arbitrary. Since 
Alice knows S and x, this step requires no communication. 

2. Alice sends T’s current memory state a to Bob. Since S uses space s, this can be 
communicated using s bits. 

3. Bob resumes the streaming algorithm S with the memory state a, and feeds into S 
the indices i for which y* = 1 (in arbitrary order). 


more difficult and important result is that the communication complexity of Disjointness remains 
f2(n) even if we allow arbitrary (not necessarily one-way) randomized protocols. We’ll use this stronger 

of Lecture 


result several times later in the course. We’ll also briefly discuss the proof in Section 
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4.3.4 


For finite k strictly larger than 2, the op timal space of a randomized (1 ± e)-approximate streaming 


algorithm turns out to be roughly Q{n} ' ) (Bar-Yossef et al. 


2002a 


Chakrabarti et al. 


2003 


Indyk and 


Woodruff 20051. See the exercises for a bit more about these problems. 
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4. Bob declares “disjoint” if and only if 5’s final answer is at most 4/3. 


To analyze this reduction, observe that the frequency of an index i G {1, 2,..., re} in 
the data stream induced by (x, y) is 0 if Xj = T/j = 0, 1 if exactly one of Xi,yi is 1, and 2 if 
Xi= Vi = 2. Thus, Too of this data stream is 2 if (x, y) is a “no” instance of Disjointness, and 
is at most 1 otherwise. By assumption, for every “yes” (respectively, “no”) input (x, y), with 
probability at least 2/3 the algorithm S outputs an estimate that is at most 1.2 (respectively, 
at least 2/1.2); in this case, the protocol V correctly decides the input (x, y). Since "P is a 


one-way protocol using s bits of communication. Theorem 1.9 implies that s = D(re). Since 


the data stream length m is re, this reduction also rules out o(?re)-space streaming algorithms 
for the problem. ■ 


Remark 1.11 (The Heavy Hitters Problem) Theorem 1.10 implies that computing 
the maximum frequency is a hard problem in the streaming model, at least for worst-case 
inputs. As mentioned, the problem is nevertheless practically quite important, so it’s 
important to make progress on it despite this lower bound. For example, consider the 
following relaxed version, known as the “heavy hitters” problem: for a parameter k, if there 
are any elements with frequency bigger than m/k, then find one or all such elements. When k 
is constant, there are good solutions to this problem: the exercises outline the “Mishra-Gries” 
algorithm, and the “Count-Min Sketch” and its variants also give good solutions (Charikar 

2005),^ The heavy hitters problem captures 


et al. 2004 Cormode and Muthukrishnan 


many of the applications that motivated the problem of computing 


1.9.3 Space Lower Bound for Randomized Exact Computation of Fq and F 2 


In Section [l.6| we also criticized our positive results for Fq and F 2 — to achieve them, we 
had to make two compromises, allowing approximation and a non-zero chance of failure. 
The reduction in the proof of Theorem |1.10 also implies that merely allowing randomization 
is not enough. 


Theorem 1.12 (Alon et al. 1999) For every non-negative integer k ^ 1, every random¬ 
ized streaming algorithm that, for every data stream, computes F^o exactly with probability 
at least 2/3 uses space D(min{re,m}). 


The proof of Theorem 1.12 is almost identical to that of Theorem |1.9| The reason the 


proof of Theorem 1.9 rules out approximation (even with randomization) is because Tx) 
differs by a factor of 2 in the two different cases (“yes” and “no” instances of DiSJOINTNESS). 


'This does not contradict Theorem 


1.9 


in the hard instances of Foo produced by that proof, all 


frequencies are in {0,1, 2} and hence there are no heavy hitters. 
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For finite k, the correct value of will be at least slightly different in the two cases, which 
is enough to rule out a randomized algorithm that is exact at least two-thirds of the time 
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The upshot of Theorem |1.12 is that, even for Fq and F 2 , approximation is essential 
to obtain a sublinear-space algorithm. It turns out that randomization is also essential — 
every deterministic streaming algorithm that always outputs a (1 ± e)-estimate of Fj. (for 
any k ^ 1) uses linear space Alon et al. (1999). The argument is not overly difficult — see 
the Exercises for the details. 


1.10 Looking Backward and Forward 

Assuming that randomized one-way communication protocols require Q{n) communication 
to solve the DiSJOINTNESS problem (Theorem 1.9), we proved that some frequency moments 
(in particular, F^o) cannot be computed in sublinear space, even allowing randomization 
and approximation. Also, both randomization and approximation are essential for our 
sublinear-space streaming algorithms for Fq and F 2 . 

The next action items are: 


1. Prove Theorem 11.91 

2. Revisit the five compromises we made to obtain positive results (Section |1.6[ ). We’ve 
showed senses in which the first three compromises are necessary. Next lecture we’ll 
see why the last two are needed, as well. 


Actually, this is not quite true (why?). But if Bob also knows the number of I’s in Alice’s input (which 
Alice can communicate in log 2 n bits, a drop in the bucket), then the exact computation of Fk allows Bob 
to distinguish “yes” and “no” inputs of Disjointness (for any k ^ 1). 












Lecture 2 


Lower Bounds for One-Way Communication: Disjointness, Index, and 

Gap-Hamming 


2.1 The Story So Far 


Recall from last lecture the simple but useful model of one-way communication complexity. 
Alice has an input x G {0,1}“, Bob has an input y G {0,1}^, and the goal is to compute 
a Boolean function / : {0,1}“ x {0,1}^ —)• {0,1} of the joint input (x, y). The players 


communicate as in Figure 2.1 Alice sends a message z to Bob as a function of x only (she 


doesn’t know Bob’s input y), and Bob has to decide the function / knowing only z and 
y (he doesn’t know Alice’s input x). The one-way communication complexity of / is the 
smallest number of bits communicated (in the worst case over (x, y)) of any protocol that 
computes /. We’ll sometimes consider deterministic protocols but are interested mostly in 
randomized protocols, which we’ll dehne more formally shortly. 



Figure 2.1 A one-way communication protocol. Alice sends a message to Bob that depends only 
on her input; Bob makes a decision based on his input and Alice’s message. 


We motivated the one-way communication model through applications to streaming 
algorithms. Recall the data stream model, where a data stream xi,..., Xm G 17 of elements 
from a universe of re = |17| elements arrive one by one. The assumption is that there is 
insufficient space to store all of the data, but we’d still like to compute useful statistics 
of it via a one-pass computation. Last lecture, we showed that very cool and non-trivial 
positive results are possible in this model. We presented a slick and low-space (0(e“^(logre-|- 
logrre) log |)) streaming algorithm that, with probability at least 1 — (5, computes a (1 ± e)- 
approximation of F 2 = LljeU if skew of the data. (Recall that fj G {0,1,2,..., rre} 
is the number of times that j appears in the stream.) We also mentioned the main idea 
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(details in the homework) for an analogous low-space streaming algorithm that estimates 
Fq, the number of distinct elements in a data stream. 

Low-space streaming algorithms S induce low-communication one-way protocols P, with 
the communication used by P equal to the space used by S. Such reductions typically have 
the following form. Alice converts her input x to a data stream and feeds it into the assumed 
space-s streaming algorithm S. She then sends the memory of S (after processing x) to 
Bob; this requires only s bits of communication. Bob then resumes 5’s execution at the 
point that Alice left off, and feeds a suitable representation of his input y into S. When 
S terminates, it has computed some kind of useful function of (x, y) with only s bits of 
communication. The point is that lower bounds for one-way communication protocols — 
which, as we’ll see, we can actually prove in many cases — imply lower bounds on the space 
needed by streaming algorithms. 

Last lecture we used without proof the following result (Theorem 1.9) ,^ 


Theorem 2.1 The one-way communication complexity of the DiSJOINTNESS problem is 
Q{n), even for randomized protocols. 

We’ll be more precise about the randomized protocols that we consider in the next section. 
Recall that an input of DiSJOINTNESS is defined by x, y G {0,1}”’, which we view as 
characteristic vectors of two subsets of {1, 2,..., n}, and the output should be “0” is there is 
an index i with Xi = pi = 1 and “1” otherwise. 


We used Theorem 1.9 to prove a few lower bounds on the space required by streaming 
algorithms. A simple reduction showed that every streaming algorithm that computes Foo, 
the maximum frequency, even approximately and with probability 2/3, needs linear (i.e., 
R(min{n, m})) space. This is in sharp contrast to our algorithms for approximating Fq and 
F 2 , which required only logarithmic space. The same reduction proves that, for Tq ^ 2 , 
exact computation requires linear space, even if randomization is allowed. A different simple 
argument (see the homework) shows that randomization is also essential for our positive 
results: every deterministic streaming algorithm that approximates Fq or F 2 up to a small 
constant factor requires linear space. 


In today’s lecture we’ll prove Theorem 1.9 introduce and prove lower bounds for a 
couple of other problems that are hard for one-way communication, and prove via reductions 
some further space lower bounds for streaming algorithms. 


2.2 Randomized Protocols 

There are many different flavors of randomized communication protocols. Before proving 
any lower bounds, we need to be crystal clear about exactly which protocols we’re talking 
about. The good news is that, for algorithmic applications, we can almost always focus 

^Though we did prove it for the special case of deterministic protocols, using a simple Pigeonhole 
Principle argument. 
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on a particular type of randomized protocols. By default, we adopt the following four 
assumptions and rules of thumb. The common theme behind them is we want to allow as 
permissible a class of randomized protocols as possible, to maximize the strength of our 
lower bounds and the consequent algorithmic applications. 

Public coins. First, unless otherwise noted, we consider public-coin protocols. This 
means that, before Alice and Bob ever show up, a deity writes an inhnite sequence of 
perfectly random bits on a blackboard visible to both Alice and Bob. Alice and Bob 
can freely use as many of these random bits as they want — it doesn’t contribute to the 
communication cost of the protocol. 

The private coins model might seem more natural to the algorithm designer — here, 
Alice and Bob just flip their own random coins as needed. Coins flipped by one player are 
unknown to the other player unless they are explicitly communicated]^ Note that every 
private-coins protocol can be simulated with no loss by a public-coins protocol: for example, 
Alice uses the shared random bits 1, 3, 5, etc. as needed, while Bob used the random bits 2, 
4, 6, etc. 

It turns out that while public-coin protocols are strictly more powerful than private-coin 
protocols, for the purposes of this course, the two models have essentially the same behavior. 
In any case, our lower bounds will generally apply to public-coin (and hence also private-coin) 
protocols. 

A second convenient fact about public-coin randomized protocols is that they are 
equivalent to distributions over deterministic protocols. Once the random bits on the 
blackboard have been hxed, the protocol proceeds deterministically. Conversely, every 
distribution over deterministic protocols (with rational probabilities) can be implemented 
via a public-coin protocol — just use the public coins to choose from the distribution. 

Two-sided error. We consider randomized algorithms that are allowed to error with 
some probability on every input (x,y), whether /(x, y) = 0 or /(x, y) = 1. A stronger 
requirement would be one-sided error — here there are two flavors, one that forbids false 
positives (but allows false negatives) and one the forbids false negatives (but allows false 
positives). Clearly, lower bounds that apply to protocols with two-sided error are at least 
as strong as those for protocols with one-sided error — indeed, the latter lower bounds 
are often much easier to prove (at least for one of the two sides). Note that the one-way 
protocols induces by the streaming algorithms in the last lecture are randomized protocols 
with two-sided error. There are other problems for which the natural randomized solutions 
have only one-sided error 

Arbitrary constant error probability. A simple but important fact is that all 
constant error probabilities e G (0, g) yield the same communication complexity, up to a 

^Observe that the one-way communication protocols induced by streaming algorithms are private-coin 
protocols — random coins flipped during the first half of the data stream are only available to the second 
half if they are explicitly stored in memory. 

■^One can also consider “zero-error” randomized protocols, which always output the correct answer but 
use a random amount of communication. We won’t need to discuss such protocols in this course. 
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constant factor. The reason is simple: the success probability of a protocol can be boosted 
through amplification (i.e., repeated trials)]^ In more detail, suppose P uses k bits on 
communication and has success at least 51% on every input. Imagine repeating P 10000 
times. To preserve one-way-ness of the protocol, all of the repeated trials need to happen in 
parallel, with the public coins providing the necessary 10000 independent random strings. 
Alice sends 10000 messages to Bob, Bob imagines answering each one — some answers will 
be “1,” others “0” — and concludes by reporting the majority vote of the 10000 answers. In 
expectation 5100 of the trials give the correct answer, and the probability that more than 
5000 of them are correct is big (at least 90%, say). In general, a constant number of trials, 
followed by a majority vote, boosts the success probability of a protocol from any constant 
bigger than ^ to any other constant less than 1. These repeated trials increase the amount 
of communication by only a constant factor. See the exercises and the separate notes on 
Chernoff bounds for further details. 

This argument justifies being sloppy about the exact (constant) error of a two-sided 
protocol. For upper bounds, we’ll be content to achieve error 49% — it can be reduced to 
an arbitrarily small constant with a constant blow-up in communication. For lower bounds, 
we’ll be content to rule out protocols with error %1 — the same communication lower 
bounds hold, modulo a constant factor, even for protocols with error 49%. 

Worst-case communication. When we speak of the communication used by a ran¬ 
domized protocol, we take the worst case over inputs (x, y) and over the coin flips of the 
protocol. So if a protocol uses communication at most k, then Alice always sends at most k 
bits to Bob. 

This dehnition seems to go against our guiding rule of being as permissive as possible. 
Why not measure only the expected communication used by a protocol, with respect to its 
coin flips? This objection is conceptually justified but technically moot — for protocols that 
can err, passing to the technically more convenient worst-case measure can only increase 
the communication complexity of a problem by a constant factor. 

To see this, consider a protocol R that, for every input (x, y), has two-sided error at 
most 1/3 (say) and uses at most k bits of communication on average over its coin flips. This 
protocol uses at most lO/c bits of communication at least 90% of the time — if it used more 
than lO/c bits more than 10% of the time, its expected communication cost would be more 
than k. Now consider the following protocol R': simulate R for up to 10k steps; if R fails to 
terminate, then abort and output an arbitrary answer. The protocol R' always sends at 
most lOA: bits of communication and has error at most that of R, plus 10% (here, ~ 43%). 
This error probability of R' can be reduced back down (to |, or whatever) through repeated 
trials, as before. 


In light of these four standing assumptions and rules, we can restate Theorem 1.9 
follows. 


as 


^We mentioned a similar “median of means” idea last lecture (developed further in the homework), when 
we discussed how to reduce the | factor in the space usage of our streaming algorithms to a factor oflog /. 
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Theorem 2.2 Every public-coin randomized one-way protocol for Disjointness that has 
two-sided error at most a constant e G (0, g) uses r2(min{n, m}) communication in the worst 
case (over inputs and coin flips). 

Now that we are clear on the formal statement of our lower bound, how do we prove it? 


2.3 Distributional Complexity 

Randomized protocols are much more of a pain to reason about than deterministic protocols. 
For example, recall our Pigeonhole Principle-based argument last lecture for deterministic 
protocols: if Alice holds an n-bit input and always sends at most n — 1 bits, then there are 
distinct inputs x, x' such that Alice sends the same message z. (For DISJOINTNESS, this 
ambiguity left Bob in a lurch.) In a randomized protocol where Alice always sends at most 
n — 1 bits, Alice can use a different distribution over (n — l)-bit messages for each of her 2”' 
inputs X, and the naive argument breaks down. While Pigeonhole Proof-type arguments can 
sometimes be pushed through for randomized protocols, this section introduces a different 
approach. 

Distributional complexity is the main methodology by which one proves lower bounds on 
the communication complexity of randomized algorithms. The point is to reduce the goal 
to proving lower bounds for deterministic protocols only, with respect to a suitably chosen 
input distribution. 


Lemma 2.3 (Yao 1983) Let D be a distribution over the space of inputs (x, y) to a 
communication problem, and e G (0, ^). Suppose that every deterministic one-way protocol 
P with 

-P»"(x,y)~D[^ wrong on (x,y)] < e 

has communication cost at least k. Then every (public-coin) randomized one-way protocol R 
with (two-sided) error at most e on every input has communication cost at least k. 


In the hypothesis of Lemma 2.3 all of the randomness is in the input — P is deterministic, 
(x, y) is random. In the conclusion, all of the randomness is in the protocol R — the input 


is arbitrary but fixed, while the protocol can flip coins. Not only is Lemma 2.3 extremely 
useful, but it is easy to prove. 


Proof of Lemma 2.3: Let i? be a randomized protocol with communication cost less than k. 
Recall that such an R can be written as a distribution over deterministic protocols, call them 
Pi, P2,..., Pg. Recalling that the communication cost of a randomized protocol is defined as 
the worst-case communication (over both inputs and coin flips), each deterministic protocol 
Pi always uses less than k bits of communication. By assumption. 


Pr(x,y)~D[^i wrong on (x,y)] > e 
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for i = Averaging over the Pj’s, we have 

Pr(x,y)~D;ij[-R wrong on (x,y)] > e. 

Since the maximum of a set of numbers is at least is average, there exists an input (x, y) 
such that 

PrR[ii wrong on (x, y)] > e, 

which completes the proof. ■ 

The converse of Lemma 12.31 also holds — whatever the true randomized communication 


complexity of a problem, there exists a bad distribution D over inputs that proves it (Yao 


19831. The proof is by strong linear programming duality or, equivalently, von Neumann’s 


Minimax Theorem for zero-sum games (see the exercises for details). Thus, the distributional 
methodology is “complete” for proving lower bounds — one “only” needs to find the right 
distribution D over inputs. In general this is a bit of a dark art, though in today’s application 
D will just be the uniform distribution. 


2.4 The Index Problem 


We prove Theorem 2.2 in two steps. The first step is to prove a linear lower bound on the 
randomized communication complexity of a problem called INDEX, which is widely useful 
for proving one-way communication complexity lower bounds. The second step, which is 
easy, reduces INDEX to DiSJOINTNESS. 

In an instance of INDEX, Alice gets an n-bit string x G {0,1}” and Bob gets an integer 
i G {1, 2, ..., n}, encoded in binary using ~ log 2 n bits. The goal is simply to compute Xi, 
the ith. bit of Alice’s input. 

Intuitively, since Alice has no idea which of her bits Bob is interested in, she has to send 
Bob her entire input. This intuition is easy to make precise for deterministic protocols, by a 
Pigeonhole Principle argument. The intuition also holds for randomized protocols, but the 
proof takes more work. 


Theorem 2.4 (Kremer et al. 1999) The randomized one-way communication complexity 
of Index is n(n). 

With a general communication protocol, where Bob can also send information to Alice, 
Index is trivial to solve using only ~ log 2 n bits of information — Bob just sends i to Alice. 
Thus Index nicely captures the difficulty of designing non-trivial one-way communication 
protocols, above and beyond the lower bounds that already apply to general protocols. 


Theorem 2.4 easily implies Theorem 2.2 


Proof of Theorem 2.2' We show that DiSJOINTNESS reduces to INDEX. Given an input (x, i) 


of Index, Alice forms the input x' = x while Bob forms the input y' = ep, here Ci is the 
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standard basis vector, with a “1” in the ith coordinate and “0”s in all other coordinates. 
Then, (x', y') is a “yes” instance of DiSJOINTNESS if and only if Xi = 0. Thus, every one-way 
protocol for INDEX induces one for DiSJOINTNESS, with the same communication cost and 
error probability. ■ 


We now prove Theorem 2.4 While some computations are required, the proof is 


conceptually pretty straightforward. 


Proof of Theorem \2.4\ We apply the distributional complexity methodology. This requires 
positing a distribution D over inputs. Sometimes this takes creativity. Here, the first thing 
you’d try — the uniform distribution D, where x and i are chosen independently and 
uniformly at random — works. 

Let c be a sufficiently small constant (like .1 or less) and assume that n is sufficiently 
large (like 300 or more). We’ll show that every deterministic one-way protocol that uses 

this 


2.3 


at most cn bits of communication has error (w.r.t. D) at least g. By Lemma 
implies that every randomized protocol has error at least i on some input. Recalling the 


discussion about error probabilities in Section 2.2, this implies that for every error e' > 0, 
there is a constant c' > 0 such that every randomized protocol that uses at most dn bits of 
communication has error bigger than e'. 

Fix a deterministic one-way protocol P that uses at most cn bits of communication. 
Since P is deterministic, there are only 2'”"' distinct messages z that Alice ever sends to 
Bob (ranging over the 2”' possible inputs x). We need to formalize the intuition that Bob 
typically (over x) doesn’t learn very much about x, and hence typically (over i) doesn’t 
know what Xi is. 

Suppose Bob gets a message z from Alice, and his input is i. Since P is deterministic. 
Bob has to announce a bit, “0” or “1,” as a function of z and i only. (Recall Figure 2.11. 


Holding z fixed and considering Bob’s answers for each of his possible inputs i = 1, 2,..., n, 
we get an n-bit vector — Bob’s answer vector a(z) when he receives message z from Alice. 
Since there are at most 2'”"' possible messages z, there are at most 2'^”' possible answer 
vectors a(z). 

Answer vectors are a convenient way to express the error of the protocol P, with respect 
to the randomness in Bob’s input. Fix Alice’s input x, which results in the message z. The 
protocol is correct if Bob holds an input i with a(z) • = xt, and incorrect otherwise. Since 
Bob’s index i is chosen uniformly at random, and independently of x, we have 


Prj[P is incorrect | x, z] = 


d//(x,a(z)) 


n 


( 2 . 1 ) 


where (i//(x, a(z)) denotes the Hamming distance between the vectors x and a(z) (i.e., 
the number of coordinates in which they differ). Our goal is to show that, with constant 
probability over the choice of x, the expression ( |2.1| ) is bounded below by a constant. 

Let A = {a(z(x)) : x G {0,1}”} denote the set of all answer vectors used by the protocol 
P. Recall that |A| < 2'”". Call Alice’s input x good if there exists an answer vector a G A 
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with diy(x, a) < and bad otherwise. Geometrically, you should think of each answer 
vector a as the center of a ball of radius j in the Hamming cube — the set {0,1}"^ equipped 
with the Hamming metric. See Figure 2.2 The next claim states that, because there aren’t 
too many balls (only 2'”"' for a small constant c) and their radii aren’t too big (only j), the 
union of all of the balls is less than half of the Hamming cubej^ 



Figure 2.2 Balls of radius n/4 in the Hamming metric, centered at the answer vectors used by the 
protocol P. 


Claim: Provided c is sufficiently small and n is sufficiently large, there are at least 2"' ^ 
bad inputs x. 


®More generally, the following is good intuition about the Hamming cube for large n: as you blow up a 
ball of radius r around a point, the ball includes very few points until r is almost equal to n/2; the ball 
includes roughly half the points for r ~ n/2; and for r even modestly larger than r, the ball contains almost 
all of the points. 
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Before proving the claim, let’s see why it implies the theorem. We can write 


Pr(x,y)~_D[-C> wrong on (x, y)] = Pr[x is good] • Pr[il) wrong on (x, y) | x is good] 

>0 

+ Pr[x is bad] •Pr[D wrong on (x, y) j x is bad] . 

'-V-" 

>1/2 by Claim 


Recalling (2.1) 


and the definition of a bad input x, we have 


Pr{x y) [D wrong on (x, y) j x is bad] = 


> 


> 


Ex 

Ex 

1 

4' 


dH(x,a(z(x))) 

n 


X is bad 


. dH{yi,a) 

mm- X IS bad 

acA n 

-V-^ 

>1/4 since x is bad 


We conclude that the protocol P errs on the distribution D with probability at last which 
implies the theorem. We conclude by proving the claim. 

Proof of Claim: Fix some answer vector a G ^4. The number of inputs x with Hamming 
distance at most ^ from a is 



dij(x,a)=l dj/(x,a)=2 djj (x,a)=72/2 


Recalling the inequality 



( 2 . 2 ) 


which follows easily from Stirling’s approximation of the factorial function (see the exercises), 
we can crudely bound ( 2 . 2 ) above by 


n(4e)"/^ = 


The total number of good inputs x — the union of all the balls — is at most < 

2 (. 86 i+c)n^ which is at most 2 **“^ for c sufficiently small (say . 1 ) and n sufficiently large (at 
least 300, say). ■ 
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2.5 Where We’re Going 


Theorem 2.4 completes our first approach to proving lower bounds on the space required 
by streaming algorithms to compute certain statistics. To review, we proved from scratch 
that Index is hard for one-way communication protocols (Theorem 2.4), reduced INDEX 
to Disjointness to extend the lower bound to the latter problem (Theorem 2.2), and 


reduced DiSJOINTNESS to various streaming computations (last lecture). See also Figure 2.3 


Specihcally, we showed that linear space is necessary to compute the highest frequency in 
a data stream (Too), even when randomization and approximation are allowed, and that 
linear space is necessary to compute exactly To or T 2 by a randomized streaming algorithm 
with success probability 2/3. 


T Theorem 12.21 Lecture [T] 

I^EX — ^ Disjointness — .! =->, Streaming 

Theorem EH 

Figure 2.3 Review of the proof structure of linear (in min{n, m}) space lower bounds for streaming 
algorithms. Lower bounds travel from left to right. 


We next focus on the dependence on the approximation parameter e required by a 
streaming algorithm to compute a (1 ± e)-approximation of a frequency moment. Recall that 
the streaming algorithms that we’ve seen for Tq and T 2 have quadratic dependence on e“^. 
Thus an approximation of 1% would require a blowup of 10,000 in the space. Obviously, it 
would be useful to have algorithms with a smaller dependence on e~^. We next prove that 
space quadratic in e~^ is necessary, even allowing randomization and even for Tq and T 2 , to 
achieve a (1 ± e)-approximation. 


Happily, we’ll prove this via reductions, and won’t need to prove from scratch any new 
communication lower bounds. We’ll follow the path in Figure 2.4 
a new problem, also very useful for proving lower bounds. 


First we introduce 


called the Gap-Hamming 
problem. Second, we give a quite clever reduction from Index to Gap-Hamming. Finally, 
it is straightforward to show that one-way protocols for Gap-Hamming with sublinear 
communication induce streaming algorithms that can compute a (1 ± e)-approximation of 
Tq or T 2 in o(e“^) space. 


T Theorem 12.51 ^ tt Sectionl2.6.2l ^ 

Index — ^ Gap-Hamming -Streaming 

Theorem ED 

Figure 2.4 Proof plan for H(e“^) space lower bounds for (randomized) streaming algorithms that 
approximate Fq or T 2 up to a 1 ± e factor. Lower bounds travel from left to right. 
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2.6 The Gap-Hamming Problem 

Our current goal is to prove that every streaming algorithm that computes a (1 ± e)- 
approximation of Fq or F 2 needs n(e“^) space. Note that we’re not going to prove this when 
e <C 1/ ^/n, since we can always compute a frequency moment exactly in linear or near-linear 
space. So the extreme case of what we’re trying to prove is that a (1 ± )-approximation 
requires n(n) space. This special case already requires all of the ideas needed to prove a 
lower bound of n(e“^) for all larger e as well. 

2.6.1 Why Disjointness Doesn’t Work 

Our goal is also to prove this lower bound through reductions, rather than from scratch. 
We don’t know too many hard problems yet, and we’ll need a new one. To motivate it, let’s 
see why Disjointness is not good enough for our purposes. 

Suppose we have a streaming algorithm S that gives a (1 ± ;^)-approximation to To — 
how could we use it to solve Disjointness? The obvious idea is to follow the reduction 
used last lecture for Too- Alice converts her input x of DISJOINTNESS and converts it to 
a stream, feeds this stream into S, sends the hnal memory state of S to Bob, and Bob 
converts his input y of DISJOINTNESS into a stream and resumes T’s computation on it. 
With healthy probability, S returns a (1 ± )-approximation of Tq of the stream induced 
by (x, y). But is this good for anything? 

Suppose (x, y) is a “yes” instance to Disjointness. Then, Tq of the corresponding stream 
is |x| -|- |y|, where | • | denotes the number of I’s in a bit vector. If (x, y) is a “no” instance of 
Disjointness, then Tq is somewhere between max{|x|, |y|} and |x| -|- |y| — 1. A particularly 
hard case is when |x| = |y| = n/2 and x, y are either disjoint or overlap in exactly one 
element — Tq is then either n or n — 1. In this case, a (1 ± )-approximation of Tq 
translates to additive error ^/n, which is nowhere near enough resolution to distinguish 
between “yes” and “no” instances of DISJOINTNESS. 

2.6.2 Reducing Gap-Hamming to Tq Estimation 

A (1 ± ;^)-approximation of Tq is insufficient to solve DISJOINTNESS — but perhaps there 
is some other hard problem that it does solve? The answer is yes, and the problem is 
estimating the Hamming distance between two vectors x, y — the number of coordinates in 
which X, y differ. 

To see the connection between Tq and Hamming distance, consider x, y G {0,1}"' and the 
usual data stream (with elements in U = {1, 2,..., n}) induced by them. As usual, we can 
interpret x, y as characteristic vectors of subsets A, B of U (Figure [2.5[ ). Observe that the 
Hamming distance ^^(x, y) is the just the size of the symmetric difference, |^\T|-|-|T\A|. 
Observe also that Tq = |A U T|, so |A \ T| = Tq — |T| and |T \ H| = Tq — |A|, and hence 



30 


Lower Bounds for One-Way Communication 


(iiy(x,y) = 2Fq — |x| — |y|. Finally, Bob knows |y|, and Alice can send |x| to Bob using 
log 2 n bits. 



Figure 2.5 The Hamming distance between two bit vectors equals the size of the symmetric 
difference of the corresponding subsets of 1-coordinates. 


The point is that a one-way protocol that computes Fq with communication c yields a 
one-way protocol that computes (i//(x, y) with communication c -|- log 2 n. More generally, a 
(1 ± ;^)-approximation of Fq yields a protocol that estimates y) up to 2Fo/y^ < 2^/n 

additive error, with log 2 n extra communication. 

This reduction from Hamming distance estimation to Fq estimation is only useful to us 
if the former problem has large communication complexity. It’s technically convenient to 
convert Hamming distance estimation into a decision problem. We do this using a “promise 
problem” — intuitively, a problem where we only care about a protocol’s correctness when 
the input satisfies some conditions (a “promise”). Formally, for a parameter t, we say that a 
protocol correctly solves GAP-HAMMlNG(t) if it outputs “1” whenever djy(x, y) < t — c^/n 
and outputs “0” whenever (ijy(x, y) > t + C\/n^ where c is a sufficiently small constant. 
Note that the protocol can output whatever it wants, without penalty, on inputs for which 
(i/f(x,y) = t± c^/n. 

Our reduction above shows that, for every t, GAP-HAMMING(f) reduces to the (1 ± 
approximation of Fq. Does it matter how we pick t? Remember we still need to prove 
that the GAP-HAMMING(t) problem does not admit low-communication one-way protocols. 
If we pick t = 0, then the problem becomes a special case of the Equality problem 
(where /(x,y) = 1 if and only x = y). We’ll see next lecture that the one-way randomized 
communication complexity of EQUALITY is shockingly low — only 0(1) for public-coin 
protocols. Picking t = n has the same issue. Picking t = f seems more promising. For 
example, it’s easy to certify a “no” instance of EQUALITY — just exhibit an index where x 
and y differ. How would you succinctly certify that (i//(x, y) is either at least ^ -|- ^/n or at 
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most I — ^/n? For more intuition, think about two vectors x, y G {0,1}"' chosen uniformly 
at random. The expected Hamming distance between them is with a standard deviation 
of ~ \/n. Thus deciding an instance of Gap-Hamming(^) has the flavor of learning an 
unpredictable fact about two random strings, and it seems difficult to do this without 
learning detailed information about the particular strings at hand. 

2.7 ] 

Lower Bound on the One-Way Communication Complexity of Cap-Hamming 

This section dispenses with the hand-waving and formally proves that every protocol that 
solves Cap-Hamming — with t = ^ and c sufficiently small — requires linear communication. 


Theorem 2.5 (Jayram et al. 2008 Woodruff 2004[ 2007) The randomized one-way 
communication complexity of Cap-Hamming is 0(re). 


Proof: The proof is a randomized reduction from INDEX, and is more clever than the other 
reductions that we’ve seen so far. Consider an input to INDEX, where Alice holds an n-bit 
string X and Bob holds an index i G {l,2,...,n}. We assume, without loss of generality, 
that n is odd and sufficiently large. 

Alice and Bob generate, without any communication, an input (x', y') to CAP-HAMMING. 
They do this one bit at a time, using the publicly available randomness. To generate the 
first bit of the Cap-Hamming input, Alice and Bob interpret the first n public coins as a 
random string r. Bob forms the bit b = r^, the ith bit of the random string. Intuitively, Bob 
says “I’m going to pretend that r is actually Alice’s input, and report the corresponding 
answer rj.” Meanwhile, Alice checks whether d_f/(x, r) < ^ or ^//(x, r) > ^. (Since n is odd, 
one of these holds.) In the former case, Alice forms the bit a = 1 to indicate that r is a 
decent proxy for her input x. Otherwise, she forms the bit a = 0 to indicate that 1 — r 
would have been a better approximation of reality (i.e., of x). 

The key and clever point of the proof is that a and b are correlated — positively if Xj = 1 
and negatively if Xi = 0, where x and i are the given input to INDEX. To see this, condition 
on the n — 1 bits of r other than i. There are two cases. In the first case, x and r agree 
on strictly less than or strictly greater than (n — l)/2 of the bits so-far. In this case, a is 
already determined (to 0 or 1, respectively). Thus, in this case, Pr[o = b] = Pr[o = r*] = 
using that r* is independent of all the other bits. In the second case, amongst the n — 1 
bits of r other than r^, exactly half of them agree with x. In this case, a = 1 if and only 
if Xi = ri. Hence, if Xj = 1, then a and b always agree (if = 1 then a = 5= l, ifrj=0 
then a = 6 = 0). If x* = 0, then a and b always disagree (if r* = 1, then o = 0 and b = 1, if 
Tj = 0, then 0 = 1 and b = 0). 

The probability of the second case is the probability of getting (n — l)/2 “heads” out 
of n — 1 coin flips, which is ((„”7)/2)' Stirling’s approximation of the factorial 
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function shows that this probability is bigger than you might have expected, namely ~ 

y/n 

for a constant d (see Exercises for details). We therefore have 


Pr[a = 6] 


Pr[Case 1] • Pr[a 
'---^ '- 



= b I Case 1] -|- Pr[Case 2] • Pr[o = b \ Case 2] 

V ^ V ^ V 

_1 c' 1 or 0 

“2 Tri 


1 

2 

1 

2 


c' 

s/n 


+ 


if Xj = 1 
if Xi = 0. 


This is pretty amazing when you think about it — Alice and Bob have no knowledge of 
each other’s inputs and yet, with shared randomness but no explicit communication, can 
generate bits correlated with 

The randomized reduction from INDEX to Gap-Hamming now proceeds as one would 
expect. Alice and Bob repeat the bit-generating experiment above m independent times to 
generate m-bit inputs x' and y' of Gap-Hamming. Here m = qn for a sufficiently large 
constant q. The expected Hamming distance between x' and y' is at most y — dy/m (if 
Xi = 1) or at least y -|- d\/m (if Xj = 0). A routine application of the Ghernoff bound 
(see Exercises) implies that, for a sufficiently small constant c and large constant q^ with 
probability at least | (say), (ijy(x',y') < y — c^/m (if Xi = 1) and cLh^x!, y') > y -|- Cy/m 
(if Xi = 0). When this event holds, Alice and Bob can correctly compute the answer to the 
original input (x, i) to INDEX by simply invoking any protocol P for Gap-Hamming on the 
input (x',y'). The communication cost is that of P on inputs of length m = 0(n). The 
error is at most the combined error of the randomized reduction and of the protocol P — 
whenever the reduction and P both proceed as intended, the correct answer to the INDEX 
input (x, i) is computed. 

Summarizing, our randomized reduction implies that, if there is a (public-coin) random¬ 
ized protocol for Gap-Hamming with (two-sided) error | and sublinear communication, 
then there is a randomized protocol for INDEX with error |. Since we’ve ruled out the latter, 
the former does not exist. ■ 


Gombining Theorem |2.5| with our reduction from Gap-Hamming to estimating 
we’ve proved the following. 


Theorem 2.6 There is a constant c > 0 such that the following statement holds: There is 
no sublinear-space randomized streaming algorithm that, for every data stream, computes Fq 
to within a 1 P ^ factor with probability at least 2/3. 

A variation on the same reduction proves the same lower bound for approximating T 2 ; 
see the Exercises. 

®This would clearly not be possible with a private-coin protocol. But we’ll see later than the (additive) 
difference between the private-coin and public-coin communication complexity of a problem is O(logn), so a 
linear communication lower bound for one type automatically carries over to the other type. 
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Our original goal was to prove that the (1 ± e)-approximate computation of Fq requires 
space 0(e“^), when e > Theorem 


2.6 proves this in the special case where e = 0(;^). 
This can be extended to larger e by a simple “padding” trick. Fix your favorite values of n 
and e > and modify the proof of Theorem 2.6 as follows. Reduce from Gap-Hamming 
on inputs of length m = 0(e~^). Given an input (x, y) of Gap-HAMMING, form (x',y') by 
appending n — m zeroes to x and y. A streaming algorithm with space s that estimates Fq 
on the induced data stream up to a (1 ± e) factor induces a randomized protocol that solves 
this special case of Gap-Hamming with communication s. Theorem 2.5 implies that every 
randomized protocol for the latter problem uses communication r2(e“^), so this lower bound 
carries over to the space used by the streaming algorithm. 









Lecture 3 


Lower Bounds for Compressive Sensing 


3.1 An Appetizer: Randomized Communication Complexity of Equality 


We begin with an appetizer before starting the lecture proper — an example that demon¬ 
strates that randomized one-way communication protocols can sometimes exhibit surprising 
power. 

It won’t surprise you that the EQUALITY function — with /(x, y) = 1 if and only if 
X = y — is a central problem in communication complexity. It’s easy to prove, by the 
Pigeonhole Principle, that its deterministic one-way communication complexity is n, where n 
is the length of the inputs x and yQ What about its randomized communication complexity? 
Recall from last lecture that by default, our randomized protocols can use public coin^and 
can have two-sided error e, where e is any constant less than 


Theorem 3.1 (Yao 1979) The (public-coin) randomized one-way communication com¬ 
plexity of Equality is 0{1). 


Thus, the randomized communication complexity of a problem can be radically smaller 
than its deterministic communication complexity. A similar statement follows from our 
upper and lower bound results for estimating the frequency moments Fq and F 2 using 
small-space streaming algorithms, but Theorem |3.1| illustrates this point in a starker and 
clearer way. 

Theorem [3.1 1 provides a cautionary tale: sometimes we expect a problem to be hard 
for a class of protocols and are proved wrong by a clever protocol; other times, clever 
protocols don’t provide non-trivial solutions to a problem but still make proving strong 
lower bounds technically difficult. Theorem 3.1 also suggests that, if we want to prove strong 
communication lower bounds for randomized protocols via a reduction, there might not be 
too many natural problems out there to reduce fromj^ 


^We’ll see later that this lower bound applies to general deterministic protocols, not just to one-way 
protocols. 

^Recall the public-coin model: when Alice and Bob show up there is already an infinite stream of random 
bits written on a blackboard, which both of them can see. Using shared randomness does not count toward 
the communication cost of the protocol. 

■^Recall our discussion about Gap-Hamming last lecture: for the problem to be hard, it is important to 
choose the midpoint t to be With t too close to 0 or n, the problem is a special case of Equality and is 
therefore easy for randomized protocols. 
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Proof of Theorem 3.1 


The protocol is as follows. 


1. Alice and Bob interpret the first 2n public coins as random strings ri, r 2 G {0,1}”. 
This requires no communication. 

2. Alice sends the two random inner products (x, ri) mod 2 and (x, r 2 ) mod 2 to Bob. 
This requires two bits of communication. 


3. Bob reports “1” if and only if his random inner products match those of Alice: 
(y, Fj) = (x, Fj) mod 2 for i = 1, 2. Note that Bob has all of the information needed 
to perform this computation. 


We claim that the error of this protocol is at most 25% on every input. The protocol’s 
error is one-sided: when x = y the protocol always accepts, so there are no false negatives. 
Suppose that x 7 ^ y. We use the Principle of Deferred Decisions to argue that, for each 
i = 1 , 2 , the inner products (y, Fj) and (x, Fj) are different (mod 2 ) with probability exactly 
50%. To see this, pick an index i where Xi 7 ^ yi and condition on all of the bits of a random 
string except for the Rh one. Let a and b denote the values of the inner products-so-far of x 
and y (modulo 2) with the random string. If the ith random bit is a 0, then the final inner 
products are also a and b. If the ith random bit is a 1, then one inner product stays the 
same while the other flips its value (since exactly one of Xi, yi is a 1). Thus, whether a = b 
or o 7 ^ 6 , exactly one of the two random bit values (50% probability) results in the final two 
inner products having different values (modulo 2). The probability that two unequal strings 
have equal inner products (modulo 2) in two independent experiments is 25%. ■ 


The proof of Theorem |3.1| gives a 2-bit protocol with (1-sided) error 25%. As usual, 
executing many parallel copies of the protocol reduces the error to an arbitrarily small 
constant, with a constant blow-up in the communication complexity. 

The protocol used to prove Theorem 3.1 makes crucial use of public coins. We’ll see later 
that the private-coin one-way randomized communication complexity is 0 (logn), which is 
worse than public-coin protocols but still radically better than deterministic protocols. More 
generally, next lecture we’ll prove Newman’s theorem, which states that the private-coin 
randomized communication complexity of a problem is at most O(logn) more than its 
public-coin randomized communication complexity. 

The protocol in the proof of Theorem |3 .1 1 effectively gives each of the two strings x,y 
a 2 -bit “sketch” or “fingerprint” such that the property of distinctness is approximately 
preserved. Clearly, this is the same basic idea as hashing. This is a useful idea in both 
theory and practice, and we’ll use it again shortly. 


Remark 3.2 The computational model studied in communication complexity is potentially 
very powerful — for example, Alice and Bob have unlimited computational power — and 
the primary point of the model is to prove lower bounds. Thus, whenever you see an upper 
bound result in communication complexity, like Theorem 3.1, it’s worth asking what the 
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point of the result is. In many cases, a positive result is really more of a “negative negative 
result,” intended to prove the tightness of a lower bound rather than offer a practical 
solution to a real problem. In other cases, the main point is demonstrate separations 
between different notions of communication complexity or between different problems. For 
example. Theorem 3.1 shows that the one-way deterministic and randomized communication 
complexity of a problem can be radically different, even if we insist on one-sided error. It also 
shows that the randomized communication complexity of EQUALITY is very different than 
that of the problems we studied last lecture: Disjointness, Index, and Gap-Hamming. 

Theorem 3.1 also uses a quite reasonable protocol, which is not far from a practical 
solution to probabilistic equality-testing. In some applications, the public coins can be 
replaced by a hash function that is published in advance; in other applications, one party 
can choose a random hash function that can be specihed with a reasonable number of bits 
and communicate it to other parties. 


3.2 Sparse Recovery 


3.2.1 The Basic Setup 

The field of sparse recovery has been a ridiculously hot area for the past ten years, in 
applied mathematics, machine learning, and theoretical computer science. We’ll study 
sparse recovery in the standard setup of “compressive sensing” (also called “compressed 
sensing”). There is an unknown “signal” — i.e., a real-valued vector x G M” — that we 
want to learn. The bad news is that we’re only allowed to access the signal through “linear 
measurements;” the good news is that we have the freedom to choose whatever measurements 
we want. Mathematically, we want to design a matrix A G with m as small as 

possible, such that we can recover the unknown signal x from the linear measurements Ax 
(whatever x may be). 

As currently stated, this is a boring problem. It is clear that n measurements are 
sufficient - just take A = /, or any other invertible n x n matrix. It is also clear that n 
measurements are necessary: if m < n, then there is a entire subspace of dimension n — m 
of vectors that have image Ax under A, and we have no way to know which one of them is 
the actual unknown signal. 

The problem becomes interesting when we also assume that the unknown signal x is 
“sparse,” in senses we dehne shortly. The hope is that under the additional promise that x 


is sparse, we can get away with much fewer than n measurements (Figure 3.1). 


3.2.2 A Toy Version 

To develop intuition for this problem, let’s explore a toy version. Suppose you are promised 
that X is a 0-1 vector with exactly k I’s (and hence n — k O’s). Throughout this lecture, k is 
the parameter that measures the sparsity of the unknown signal x — it could be anything. 
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Ax 


X 

I 


given 


Figure 3.1 The basic compressive sensing setnp. The goal is to design a matrix A such that an 
unknown sparse signal x can be recovered from the linear measurements Ax. 


but you might want to keep k ~ y/n in mind as a canonical parameter value. Let X denote 
the set of all such fc-sparse 0-1 vectors, and note that |A| = 

Here’s a solution to the sparse recovery problem under that guarantee that x G X. Take 
m = 31og2 |X|, and choose each of the m rows of the sensing matrix A independently and 
uniformly at random from {0,1}"'. By the fingerprinting argument used in our randomized 
protocol for EQUALITY (Theorem 3.1), for fixed distinct x,x' G X, we have 


PrA [Ax = Ax' mod 2] 


1 

2m 


1 

W' 


Of course, the probability that Ax = Ax' (not modulo 2) is only less. Taking a Union 
Bound over the at most |Xp different distinct pairs x, x' G X, we have 


PrA [there exists x ^ x' s.t. Ax 


Ax'] < 


1 

X 


Thus, there is a matrix A that maps all x G X to distinct m-vectors. (Indeed, a random A 
works with high probability.) Thus, given Ax, one can recover x — if nothing else, one can 
just compute Ax' for every x' G X until a match is foundj^ 

■^We won’t focus on computational efficiency in this lecture, but positive results in compressive sensing 
generally also have computationally efficient recovery algorithms. Our lower bounds will hold even for 
recovery algorithms with unbounded computational power. 
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The point is that m = 0(log |X|) measurements are sufficient to recover exactly A:-sparse 
0-1 vectors. Recalling that |X| = and that 



easy 


< 

Stirling’s approx. 




we see that m = 0(A:log rows sufficej^ 

This exercise shows that, at least for the special case of exactly sparse 0-1 signals, we 
can indeed achieve recovery with far fewer than n measurements. For example, if A; ~ yTi, 
we need only 0{y/nlogn) measurements. 

Now that we have a proof of concept that recovering an unknown sparse vector is an 
interesting problem, we’d like to do better in two senses. First, we want to move beyond the 
toy version of the problem to the “real” version of the problem. Second, we have to wonder 
whether even fewer measurements suffice. 


3.2.3 Motivating Applications 

To motivate the real version of the problem, we mention a couple of canonical applications of 
compressive sensing. One buzzword you can look up and read more about is the “single-pixel 
camera.” The standard approach to taking pictures is to first take a high-resolution picture 
in the “standard basis” — e.g., a light intensity for each pixel — and then to compress the 
picture later (via software). Because real-world images are typically sparse in a suitable 
basis, they can be compressed a lot. The compressive sensing approach asks, then why 
not just capture the image directly in a compressed form — in a representation where its 
sparsity shines through? For example, one can store random linear combinations of light 
intensities (implemented via suitable mirrors) rather than the light intensities themselves. 
This idea leads to a reduction in the number of pixels needed to capture an image at a 
given resolution. Another application of compressive sensing is in MRI. Here, decreasing 
the number of measurements decreases the time necessary for a scan. Since a patient needs 
to stay motionless during a scan — in some cases, not even breathing — shorter scan times 
can be a pretty big deal. 


3.2.4 The Real Problem 

If the unknown signal x is an image, say, there’s no way it’s an exactly sparse 0-1 vector. We 
need to consider more general unknown signals x that are real-valued and only “approximately 
sparse.” To measure approximate sparsity, with respect to a choice of k, we define the 
residual res(x) of a vector x G M"" as the contribution to x’s £i norm by its n — A: coordinates 
with smallest magnitudes. Recall that the ii norm is just ||x||i = XlILi 1^*1- imagine 

®If you read through the compressive sensing literature, you’ll be plagued by ubiquitous “felog terms 
— remember this is just « log (^). 
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sorting the coordinates by \xi\, then the residual of x is just the sum of the |xj|’s of the final 
n — k terms. If x has exactly A:-sparse, it has at least n — k zeros and hence res(x) = 0. 

The goal is to design a sensing matrix A with a small number m of rows such that an 
unknown approximately sparse vector x can be recovered from Ax. Or rather, given that x 
is only approximately sparse, we want to recover a close approximation of x. 

The formal guarantee we’ll seek for the matrix A is the following: for every x G M", we 
can compute from Ax a vector x' such that 


“ x|li < c • res(x). 


(3.1) 


Here c is a constant, like 2. The guarantee (3.1) is very compelling. First, if x is exactly 


fe-sparse, then res(x) = 0 and so (3.1) demands exact recovery of x. The guarantee is 
parameterized by how close x is to being sparse — the recovered vector x' should lie in a 
ball (in the norm) around x, and the further x is from being fc-sparse, the bigger this 
ball is. Intuitively, the radius of this ball has to depend on something like res(x). For 


example, suppose that x' is exactly /c-sparse (with res(x) = 0). The guarantee (3.1) forces 
the algorithm to return x' for every unknown signal x with Ax = Ax'. Recall that when 
m < n, there is an (n — m)-dimensional subspace of such signals x. In the extreme case 
where there is such an x with x — res(x) = x' - i.e., where x is x' with a little noise added 
to its zero coordinates — the recovery algorithm is forced to return a solution x' with 
||x' — x||^ = res(x). 

Now that we’ve defined the real version of the problem, is there an interesting solution? 
Happily, the real version can be solved as well as the toy version. 


Fact 3.3 (Candes et al. 2006; Donoho 2006) With high probability, a random m x n 


matrix A with ©(/clog^) 
in (3.1) for every x G M". 


rows admits a recovery algorithm that satisfies the guarantee 


Fact 3.3 is non-trivial and well outside the scope of this course. The fact is true for 
several different distributions over matrices, including the case where each matrix entry 
is an independent standard Gaussian. Also, there are computationally efficient recovery 
algorithms that achieve the guarantee]^ 


3.3 A Lower Bound for Sparse Recovery 
3.3.1 Context 


At the end of Section 3.2.2 when we asked “can we do better?,” we meant this in two senses. 
First, can we extend the positive results for the toy problem to a more general problem? 


Fact 3.3 provides a resounding affirmative answer. Second, can we get away with an even 


®See e.g. [Moitra | |2014 \ for an introduction to such positive results about sparse recovery. Lecture of 
the instructor’s CS264 course also gives a brief overview. 
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smaller value of m — even fewer measurements? In this section we prove a relatively recent 
result of Do Ba et al. (20101, who showed that the answer is “no.’j^ Amazingly, they proved 
this fundamental result via a reduction to a lower bound in communication complexity (for 
Index). 


Theorem 3.4 (Do Ba et al. 2010) If an m x n matrix A admits a recovery algorithm 


computes from Ax a vector x' that satisfies (3.11, then m = 


R that, for every x G ] 

D(Hog^). 

The lower bound is information-theoretic, and therefore applies also to recovery algorithms 
with unlimited computational power. 

Note that there is an easy lower bound of m > Theorem 3.4 offers a non-trivial 
improvement when k = o{n)] in this case, we’re asking whether or not it’s possible to shave 
off the log factor in Fact |3.3[ Given how fundamental the problem is, and how much a 
non-trivial improvement could potentially matter in practice, this question is well worth 
asking. 


3.3.2 


Proof of Theorem 


3.4 


First Attempt 


Recall that we first proved our upper bound of m = 0(/clog^) in the toy setting of 
Section 3.2.2, and then stated in Fact |3.3| that it can be extended to the general version of 
the problem. Let’s first try to prove a matching lower bound on m that applies even in the 
toy setting. Recall that X denotes the set of all 0-1 vectors that have exactly k I’s, and 


that |A| = {1). 

For vectors x G A, the guarantee ( |3.1| ) demands exact recovery. Thus, the sensing 
matrix A that we pick has to satisfy Ax 7 ^ Ax' for all distinct x,x' G X. That is. Ax 
encodes x for all x G A. But A has (^) members, so the worst-case encoding length of Ax 
has to be at least log 2 (^) = ©(/dog ^). So are we done? 

The issue is that we want a lower bound on the number of rows m of A, not on the 
worst-case length of Ax in bits. What is the relationship between these two quantities? 
Note that, even if A is a 0-1 matrix, then each entry of Ax is generally of magnitude 0(/c), 
requiring Q(logk) bits to write down. For example, when k is polynomial in n (like our 
running choice k = \/n), then Ax generally requires R(mlogn) bits to describe, even when 
A is a 0-1 matrix. Thus our lower bound of ©(/dog on the length of Ax does not yield a 
lower bound on m better than the trivial lower bound of A:J3 


^Such a lower bound was previously known for various special cases — for particular classes of matrices, 
for particular families of recovery algorithms, etc. 

®For example, consider just the subset of vectors x that are zero in the final n — k coordinates. The 
guarantee ( |3.1[ ) demands exact recovery of all such vectors. Thus, we’re back to the boring version of the 
problem mentioned at the beginning of the lecture, and A has to have rank at least k. 

®This argument does imply that m = Q{klog is we only we report Ax modulo 2 (or some other 
constant), since in this case the length of Ax is 0(m). 
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The argument above was doomed to fail. The reason is that, if you only care about 
recovering exactly /c-sparse vectors x — rather than the more general and robust guarantee 
in (3.11 — then m = 2k suffices! One proof is via “Prony’s Method,” which uses the fact 
that a /c-sparse vector x can be recovered exactly from its first 2k Fourier coefficients 


^see 

e.g. Moitra (2014))d^ Our argument above only invoked the requirement (3.1) for /c-sparse 
vectors x, and such an argument cannot prove a lower bound of the form m = n(/clog ^). 

The hrst take-away from this exercise is that, to prove the lower bound that we want, 
we need to use the fact that the matrix A satisfies the guarantee (3.1) also for non-/c-sparse 
vectors x. The second take-away is that we need a smarter argument — a straightforward 
application of the Pigeonhole Principle is not going to cut it. 


3.3.3 


Proof of Theorem 


3.4 


A Communication Complexity Perspective 


We can interpret the failed proof attempt in Section |3.3.2| as an attempted reduction from a 
“promise version” of INDEX. Recall that in this communication problem, Alice has an input 
X G {0, l}'^. Bob has an index i G {1, 2,..., n}, specihed using ~ log 2 n bits, and the goal 
is to compute Xi using a one-way protocol. We showed last lecture that the deterministic 
communication complexity of this problem is n (via an easy counting argument) and its 
randomized communication complexity is fl(n) (via a harder counting argument). 

The previous proof attempt can be rephrased as follows. Consider a matrix A that 
permits exact recovery for all x G A. This induces a one-way protocol for solving INDEX 
whenever Alice’s input x lies in A — Alice simply sends Ax to Bob, Bob recovers x, and 
Bob can then solve the problem, whatever his index i might be. The communication cost of 
this protocol is exactly the length of Ax, in bits. The deterministic one-way communication 
complexity of this promise version of Index is A; log by the usual counting argument, so 
this lower bound applies to the length of Ax. 


The Plan 

How can we push this idea further? We begin by assuming the existence of an m x n 
matrix A, a recovery algorithm R, and a constant c > 1 such that, for every x G M"", R 
computes from Ax a vector x' G M”" that satisfies 

||x^ ~ x|li < c • res(x). (3.2) 

Our goal is to show that if m << /clog then we can solve Index with sublinear commu¬ 
nication. 

^'^This method uses a sensing matrix A for which Ax will generally have n(mlogn) = P(fclogn) bits, so 
this does not contradict out lower bound on the necessary length of Ax. 
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For simplicity, we assume that the recovery algorithm R is deterministic. The lower 
bound continues to hold for randomized recovery algorithms that have success probability 
at least but the proof requires more work; see Section 


An Aside on Bit Complexity 

We can assume that the sensing matrix A has reasonable bit complexity. Roughly: (i) we 
can assume without loss of generality that A has orthonormal rows (by a change of basis 
argument); (ii) dropping all but the O(logn) highest-order bits of every entry has negligible 
effect on the recovery algorithm R. We leave the details to the Exercises. 

When every entry of the matrix A and of a vector x can be described in O(logn) bits — 
equivalently, by scaling, is a polynomially-bounded integer — the same is true of Ax. In 
this case. Ax has length 0(mlogn). 


3.3.4 


Redefining X 

It will be useful later to redefine the set X. Previously, X was all 0-1 n-vectors with exactly 
k I’s. Now it will be a subset of such vectors, subject to the constraint that 

di^(x,x') > .Ik 

for every distinct pair x, x' of vectors in the set. Recall the dni', ■) denotes the Hamming 
distance between two vectors — the number of coordinates in which they differ. Two distinct 
0-1 vectors with k I’s each have Hamming distance between 2 and 2k, so we’re restricting 
to a subset of such vectors that have mostly disjoint supports. The following lemma says 
that there exist sets of such vectors with size not too much smaller than the number of all 
0-1 vectors with k I’s; we leave the proof to the Exercises. 


Lemma 3.5 Suppose k < .Oln. There is a set X of 0-1 n-bits vectors such that each x G A 
has k Is, each distinct x,x' G A satisfy dH{x,x') > .Ik, and log 2 |A| = Q{klog ^). 


Lemma 3.5 is reminiscent of the fact that there are large error-correcting codes with large 
distance. One way to prove Lemma 3.5 is via the probabilistic method — by showing that a 
suitable randomized experiment yields a set with the desired size with positive probability. 

Intuitively, our proof attempt in Section 3.3.2 used that, because each x G A is exactly 
A:-sparse, it can be recovered exactly from Ax and thus there is no way to get confused 
between distinct elements of A from their images. In the following proof, we’ll instead 
need to recover “noisy” versions of the x’s — recall that our proof cannot rely only on the 
fact that the matrix A performs exact recovery of exactly sparse vectors. This means we 
might get confused between two different 0-1 vectors x, x' that have small (but non-zero) 
Hamming distance. The above redefinition of A, which is essentially costless by Lemma [3. 5 1 
fixes the issue by restricting to vectors that all look very different from one another. 
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The Reduction 


To obtain a lower bound of m = log rather than m = we need a more 

sophisticated reduction than in Section 3.3.2P^ The parameters offer some strong clues as to 


what the reduction should look like. If we use a protocol where Alice sends Ay to Bob, where 
A is the assumed sensing matrix with a small number m of rows and y is some vector of Alice’s 
choosing — perhaps a noisy version of some x € X — then the communication cost will be 
O(mlogn). We want to prove that m = 17(A:log r) = f^(log |A|) (using Lemma 3.5). This 


means we need a communication lower bound of fl(log |A| logn). Since the communication 
lower bound for INDEX is linear, this suggests considering inputs to INDEX where Alice’s 
input has length log \X\ logn, and Bob is given an index i G {1, 2,..., log |A| logn}. 

To describe the reduction formally, let enc : X —)• {0, 1 }*°§ 2 1^1 be a binary encoding of 
the vectors of X. (We can assume that \X\ is a power of 2.) Here is the communication 
protocol for solving INDEX. 

(1) Alice interprets her (log |A| logn)-bit input as logn blocks of log \X\ bits each. For 
each j = 1, 2,..., logn, she interprets the bits of the jth block as enc(xj) for some 


Xj G X. See Figure 3.2 


(2) Alice computes a suitable linear combination of the Xj’s: 


log n 

y='^a^ 

i=l 


X 


•J’ 


with the details provided below. Each entry of y will be a polynomially-bounded 
integer. 

(3) Alice sends Ay to Bob. This uses 0(mlogn) bits of communication. 

(4) Bob uses Ay and the assumed recovery algorithm R to recover all of xi,... ,xiog„. 
(Details TBA.) 

(5) Bob identifies the block j of log | A| bits that contains his given index i G {1,2,..., log | A| log r 
and outputs the relevant bit of enc(xj). 

If we can implement steps (2) and (4), then we’re done: this five-step deterministic protocol 
would solve Index on log |A| logn-bit inputs using 0(mlogn) communication. Since 
the communication complexity of INDEX is linear, we conclude that m = H(log|A|) = 
D(Hogf)g 

assume from now on that k = o(n), since otherwise there is nothing to prove. 

^■^Thus even the deterministic communication lower bound for Index, which is near-trivial to prove 


(Lectures [T]|^ , has very interesting implications. See Section 3.3.4 for the stronger implications provided by 
the randomized communication complexity lower bound. 
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enc(Xi) 

enc(x2) 


enc(X|o, „) 



_ I 

log2 |X| bits 


log2 |X| log2 n bits 


Figure 3.2 In the reduction, Alice interprets her log |A| logn-bit input as logn blocks, with each 
block encoding some vector Xj € X. 


For step (2), let a > 2 be a sufficiently large constant, depending on the constant c in 
the recovery guarantee (3.2) that the matrix A and algorithm R satisfy. Then 


log n 

y = ^ 

i=l 


(3.3) 


Recall that each Xj is a 0-1 n-vector with exactly k I’s, so y is just a superposition of 
logn scaled copies of such vectors. For example, the ii norm of y is simply k 
particular, since a is a constant, the entries of y are polynomially bounded non-negative 
integers, as promised earlier, and Ay can be described using O(mlogn) bits. 


Bob’s Recovery Algorithm 


The interesting part of the protocol and analysis is step (4), where Bob wants to recover the 
vectors xi, ..., xiogn encoded by Alice’s input using only knowledge of Ay and black-box 
access to the recovery algorithm R. To get started, we explain how Bob can recover the last 
vector xiogn, which suffices to solve INDEX in the lucky case where Bob’s index i is one of the 
last log 2 \X\ positions. Intuitively, this is the easiest case, since xiog„ is by far (for large a) 


the largest contributor to the vector y Alice computes in (3.3). With y = a*°®”'xiogn+ noise 


we might hope that the recovery algorithm can extract c?^”xiog„ from Ay. 

Bob’s first step is the only thing he can do: invoke the recovery algorithm R on the 
message Ay from Alice. By assumption, R returns a vector y satisfying 


||y-y|li < c-res(y), 

where res(y) is the contribution to y’s ii norm by its smallest n — k coordinates. 
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Bob then computes y’s nearest neighbor x* in a scaled version of X under the ii norm 
argmin^gx lly ~ Bob can do this computation by brute force. 

is indeed This follows from a 


Briefly: (i) because a is large, y and a*°®”'xiog, 


The key claim is that the computed vector x* 
geometric argument, pictured in Figure i 
are close to each other; (ii) since y is approximately fe-sparse (by (i)) and since R satishes 
the approximate recovery guarantee in (3.2), y and y are close to each other and hence 
a^°®'^xiogn and y are close; and (hi) since distinct vectors in X have large Hamming distance. 


for every x G X other than xiogr 


a 


log n 


X is far from xiogn and hence also far from y- We 


conclude that Q;^°®”'xiogn is closer to y than any other scaled vector from X. 




Figure 3.3 The triangle inequality implies that the vector x* computed by Bob from y must be 

^log n ■ 


We now supply the details. 

(i) Recall that y is the superposition (i.e., sum) of scaled versions of the vectors 
xi,... jXiogn- y — «^°®"xiogn is just y with the last contributor omitted. Thus, 


log 71—1 

|y - a‘“'=-xiogn||i = ^ a^k. 




Assuming that a > max{2, 200c}, where c is the constant that A and R satisfy in (3.2), 
we can bound from above the geometric sum and derive 


|y - a^°®”xiogn|li S 


< — 


(3.4) 


By considering the n — k coordinates of y other than the k that are non-zero in xiogn, 
we can upper bound the residual res(y) by the contributions to the d-i weight by 
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axi,..., ^xiog„_i. The sum of these contributions is k ^ , which we 

already bounded above in (3.4). In light of the recovery guarantee (3.2), we have 


iin—1 


li< 




(3.5) 


(iii) Let x,x' G X be distinct. By the dehnition of X, > -Ik. Since x, x' are 

both 0-1 vectors, 

||ai°g^x - "x'lli > (3.6) 


Combining (3.4) and (3.5) with the triangle inequality, we have 


(3.7) 


Meanwhile, for every other x G X, combining (3.6) and ( |3.7[ ) with the triangle inequality 
gives 

||y-a'°snx||^ > (3 8^ 


Inequalities (3.7) and (3.8) imply that Bob’s nearest-neighbor computation will indeed 


recover xiog„. 

You’d be right to regard the analysis so far with skepticism. The same reason that Bob 
can recover xiogn — because y is just a scaled version of xiogn, plus some noise — suggests 
that Bob should not be able to recover the other Xj’s, and hence unable to solve INDEX for 
indices i outside of the last block of Alice’s input. 

The key observation is that, after recovering xiogn? Bob can “subtract it out” without 
any further communication from Alice and then recover xiogn-i- Iterating this argument 
allows Bob to reconstruct all of xi,... ,xiog„ and hence solve INDEX, no matter what his 
index i is. 

In more detail, suppose Bob has already reconstructed Xj+i,... ,xiog„. He’s then in a 
position to form 

z = a^+^Xj+i H-h a^°®”xiog„. (3.9) 


Then, y — z equals the first j contributors to y — subtracting z undoes the last log n — j of 
them — and is therefore just a scaled version of Xj, plus some relatively small noise (the 
scaled x^’s with i < j). This raises the hope that Bob can recover a scaled version of xj 
from A(y — z). How can Bob get his hands on the latter vector? (He doesn’t know y, and 
we don’t want Alice to send any more bits to Bob.) The trick is to use the linearity of A — 
Bob knows A and z and hence can compute Az, Alice has already sent him Ay, so Bob 
just computes Ay — Az = A(y — z)! 

After computing A(y — z). Bob invokes the recovery algorithm R to obtain a vector 
w G M"' that satisfies 

||w- (y-z)ll^ < c-res(y-z), 

and computes (by brute force) the vector x G X minimizing ||w — q:-^x||^. The minimizing 
vector is Xj — the reader should check that the proof of this is word-for-word the same 
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as our recovery proof for xiogm with every “logn” replaced by “j,” every “y” replaced by 
“y — z,” and every “y” replaced by “w.” 

This completes the implementation of step (4) of the protocol, and hence of the reduction 
from Index to the design of a sensing matrix A and recovery algorithm R that satisfy (3.2). 
We conclude that A must have m = n(A;log r), which completes the proof of Theorem 


3.4 


3.3.4 Lower Bounds for Randomized Recovery 


We proved the lower bound in Theorem |3.4| only for hxed matrices A and deterministic 
recovery algorithms R. This is arguably the most relevant case, but it’s also worth asking 
whether or not better positive results (i.e., fewer rows) are possible for a randomized 
recovery requirement, where recovery can fail with constant probability. Superficially, the 
randomization can come from two sources: first, one can use a distribution over matrices A; 
second, the recovery algorithm (given Ax) can be randomized. Since we’re not worrying about 
computational efficiency, we can assume without loss of generality that R is deterministic — 
a randomized recovery algorithm can be derandomized just be enumerating over all of its 
possible executions and taking the majority vote. 

Formally, the relaxed requirement for a positive result is: there exists a constant c > 1, 
a distribution D over m x n matrices A, and a (deterministic) recovery algorithm R such 
that, for every x G M"', with probability at least 2/3 (over the choice of A), R returns a 
vector x' G M"' that satisfies 

||x^ “ x|li < c • res(x). 


The lower bound in Theorem 3.4 applies even to such randomized solutions. The obvious 
idea is to follow the proof of Theorem |3.4| to show that a randomized recovery guarantee 
yields a randomized protocol for INDEX. Since even randomized protocols for the latter 
problem require linear communication, this would imply the desired lower bound. 

The first attempt at modifying the proof of Theorem 3.4 has Alice and Bob using the 
public coins to pick a sensing matrix A at random from the assumed distribution — thus, A 
is known to both Alice and Bob with no communication. Given the choice of A, Alice sends 
Ay to Bob as before and Bob runs the assumed recovery algorithm R. With probability at 
least 2/3, the result is a vector y from which Bob can recover xiogn- The issue is that Bob 
has to run R on logn different inputs, once to recover each of xi,..., xiog„, and there is a 
failure probability of | each time. 

The obvious fix is to reduce the failure probability by independent trials. So Alice and 
Bob use the public coins to pick i = ©(loglogn) matrices A^,..., A^ independently from 
the assumed distribution. Alice sends A^y,..., A^y to Bob, and Bob runs the recovery 
algorithm R on each of them and computes the corresponding vectors xi,..., X£ G A. Except 
with probability 0(1/logn), a majority of the xj’s will be the vector xiogn- By a Union 
Bound over the log n iterations of Bob’s recovery algorithm. Bob successfully reconstructs 
each of xi,... ,xiogn with probability at least 2/3, completing the randomized protocol for 
Index. This protocol has communication cost 0(m logn log logn) and the lower bound 
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for Index is r2(log |X| logn), which yields a lower bound of m = fl(fclog |^/loglogn) for 
randomized recovery. 

The reduction in the proof of Theorem 3.4 can be modified in a different way to avoid the 
log log n factor and prove the same lower bound of m = fl(A:log that we established for 
deterministic recovery. The trick is to modify the problem being reduced from (previously 
Index) so that it becomes easier — and therefore solvable assuming only a randomized 
recovery guarantee — subject to its randomized communication complexity remaining linear. 

The modified problem is called AUGMENTED INDEX— it’s a contrived problem but has 
proved technically convenient in several applications. Alice gets an input x G {0,1}^. Bob 
gets an index i G {1, 2,..., £} and also the subsequent bits Xj+i ,... ,xi of Alice’s input. This 
problem is obviously only easier than INDEX, but it’s easy to show that its deterministic 
one-way communication complexity is I (see the Exercises). With some work, it can be 


shown that its randomized one-way communication complexity is Vl(€) (see Bar-Yossef et al. 


(20041; Do Ba et al. (2010); Miltersen et al. (19981) 


The reduction in Theorem 3.4 is easily adapted to show that a randomized approxi¬ 
mate sparse recovery guarantee with matrices with m rows yields a randomized one-way 
communication protocol for AUGMENTED INDEX with log |A| logn-bit inputs with commu¬ 
nication cost O(mlogn) (and hence m = D(log|A|) = r2(A:log |^)). We interpret Alice’s 
input in the same way as before. Alice and Bob use the public coins to a pick a matrix 
A from the assumed distribution and Alice sends Ay to Bob. Bob is given an index 
i G {1, 2,..., log \X\ log re} that belongs to some block j. Bob is also given all bits of Alice’s 
input after the ith one. These bits include enc(xj_|_i),... ,enc(xiogn)) so Bob can simply 


compute z as in (3.9) (with no error) and invoke the recovery algorithm R (once). Whenever 
A is such that the guarantee (3.2) holds for y — z. Bob will successfully reconstruct x,- and 


therefore compute the correct answer to Augmented Index. 


3.3.5 Digression 

One could ask if communication complexity is “really needed” for this proof of Theorem |3.4| 
Churlish observers might complain that, due to the nature of communication complexity 
lower bounds (like those last lecture), this proof of Theorem 3.4 is “just counting.’!^ While 


not wrong, this attitude is counterproductive. The fact is that adopting the language and 
mindset of communication complexity has permitted researchers to prove results that had 
previously eluded many smart people — in this case, the ends justifies the means. 

The biggest advantage of using the language of communication complexity is that 
one naturally thinks in terms of reductions between different lower boundsd^ Reductions 


^■^Critics said (and perhaps still say) the same thing about the probabilistic method |Alon and Spencer 


20081 . 

Thinking in terms of reductions seems to be a “competitive advantage” of theoretical computer scientists 
— there are several examples where a reduction-based approach yielded new progress on old and seemingly 
intractable mathematical problems. 
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can repurpose a single counting argument, like our lower bound for INDEX, for lots of 
different problems. Many of the more important lower bounds derived from communication 
complexity, including today’s main result, involve quite clever reductions, and it would be 
difficult to devise from scratch the corresponding counting arguments. 



Lecture 4 


Boot Camp on Communication Complexity 


4.1 Preamble 


This lecture covers the most important basic facts about deterministic and randomized 
communication protocols in the general two-party model, as defined by Yao (19791. Some 
version of this lecture would normally be the first lecture in a course on communication 
complexity. How come it’s the fourth one here? 


The first three lectures were about one-way communication complexity — communication 
protocols where there is only one message, from Alice to Bob — and its applications. One 
reason we started with the one-way model is that several of the “greatest hits” of algorithmic 
lower bounds via communication complexity, such as space lower bounds for streaming 
algorithms and row lower bounds for compressive sensing matrices, already follow from 
communication lower bounds for one-way protocols. A second reason is that considering 
only one-way protocols is a gentle introduction to what communication protocols look like. 
There are already some non-trivial one-way protocols, like our randomized protocol for 
Equality. On the other hand, proving lower bounds for one-way protocols is much easier 
than proving them for general protocols, so it’s also a good introduction to lower bound 
proofs. 


The rest of our algorithmic applications require stronger lower bounds that apply to 
more than just one-way protocols. This lecture gives a “boot camp” on the basic model. 
We won’t say much about applications in this lecture, but the final five lectures all focus 
on applications. We won’t prove any hard results today, and focus instead on definitions 
and vocabulary, examples, and some easy results. One point of today’s lecture is to get 
a feel for what’s involved in proving a communication lower bound for general protocols. 
It generally boils down to a conceptually simple, if sometimes mathematically challenging, 
combinatorial problem — proving that a large number of “rectangles” of a certain type are 
need to cover a matrix 0 


^ There are many other methods for proving communication lower bounds, some quite deep and exotic 
(see e.g. Lee and Shraibman (2009l), but all of our algorithmic applications can ultimately be derived from 
combinatorial covering-type arguments. For example, we’re not even going to mention the famous “rank 
lower bound.” For your edification, some other lower bound methods are discussed in the Exercises. 
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4.2 Deterministic Protocols 
4.2.1 Protocols 

We are still in the two-party model, where Alice has an input x € X unknown to Bob, 
and Bob has an input y € Y unknown to Alice. (Most commonly, X = Y = {0,1}”".) A 
deterministic communication protocol specihes, as function of the messages sent so far, 
whose turn it is to speak. A protocol always specifies when the communication has ended 
and, in each end state, the value of the computed bit. Alice and Bob can coordinate in 
advance to decide upon the protocol, and both are assumed to cooperative fully. The only 
constraint faced by the players is that what a player says can depend only on what the 
player knows — his or her own input, and the history of all messages sent so far. 

Like with one-way protocols, we define the cost of a protocol as the maximum number 
of bits it ever sends, ranging over all inputs. The communication complexity of a function is 
then the minimum communication cost of a protocol that correctly computes it. 

The key feature of general communication protocols absent from the special case of 
one-way protocols is interaction between the two players. Intuitively, interaction should 
allow the players to communicate much more efficiently. Let’s see this in a concrete example. 


4.2.2 Example: Clique-Independent Set 


The following problem might seem contrived, but it is fairly central in communication 
complexity. There is a graph G = {V, E) with \ V\ = n that is known to both players. Alice’s 
private input is a clique C of G — a subset of vertices such that (tt, v) £ E for every distinct 
u,v £ C. Bob’s private input is an independent set I of G — a subset of vertices such that 
(u,v) 0 E for every distinct u,v £ I. (There is no requirement that G or / is maximal.) 


Observe that G and I are either disjoint, or they intersect in a single vertex (Figure 4.11. 
The players’ goal is to hgure out which of these two possibilities is the case. Thus, this 
problem is a special case of DiSJOINTNESS, where players’ sets are not arbitrary but rather 
a clique and an independent set from a known graph. 

The naive communication protocol for solving the problem using 0(n) bits — Alice can 
send the characteristic vector of G to Bob, or Bob the characteristic vector of I to Alice, 
and then the other player computes the correct answer. Since the number of cliques and 
independent sets of a graph is generally exponential in the number n of vertices, this protocol 
cannot be made signihcantly more communication-efficient via a smarter encoding. An 
easy reduction from INDEX shows that one-way protocols, including randomized protocols, 
require Q{n) communication (exercise). 

The players can do much better by interacting. Here is the protocol. 


1. If there is a vertex v £ C with deg(u) < then Alice sends the name of an arbitrary 
such vertex to Bob (~ log 2 n bits). 
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Figure 4.1 A clique C and an independent set I overlap in zero or one vertices. 


a) Bob announces whether or not v £ I {1 bit). If so, the protocol terminates with 
conclusion “not disjoint.” 

b) Otherwise, Alice and Bob recurse on the subgraph H induced by v and its 
neighbors. 

[Note: H contains at most half the nodes of G. It contains all of C and, if I 
intersects C, it contains the vertex in their intersection. C and I intersect in G 
if and only if their projections to H intersect in H.] 

2. Otherwise, Alice sends a “NULL” message to Bob (~ log 2 n bits). 

3. If there is a vertex v £ I with deg(u) > then Bob sends the name of an arbitrary 
such vertex to Bob (~ log 2 n bits). 

a) Alice announces whether or not v £ G {1 bit). If so, the protocol terminates 
(“not disjoint”). 

b) If not, Alice and Bob recurse on the subgraph H induced by v and its non¬ 
neighbors. 

[Note: H contains at most half the nodes of G. It contains all of I and, if C 
intersects /, it contains the vertex in their intersection. Thus the function’s 
answer in H is the same as that in G.] 

4. Otherwise, Bob terminates the protocol and declares “disjoint.” 

[Disjointness is obvious since, at this point in the protocol, we know that deg(u) < ^ 
for all u G G and deg(u) > ^ for all v £ L] 
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Since each iteration of the protocol uses O(logn) bits of communication and cuts the 
number of vertices of the graph in half (or terminates), the total communication is 0(log^ n). 
As previously noted, such a result is impossible without interaction between the players. 


4.2.3 Trees and Matrices 

The Clique-Independent Set problem clearly demonstrates that we need new lower 
bound techniques to handle general communication protocols — the straightforward Pi¬ 
geonhole Principle arguments that worked for one-way protocols are not going to be good 
enough. At first blush this might seem intimidating — communication protocols can do all 
sorts of crazy things, so how can we reason about them in a principled way? How can we 
connect properties of a protocol to the function that it computes? Happily, we can quickly 
build up some powerful machinery for answering these questions. 

First, we observe that deterministic communication protocols are really just binary trees. 
We’ll almost never use this fact directly, but it should build some confidence that protocols 
are familiar and elementary mathematical objects. 

Consider the following 


The connection is easiest to see by example; see Figure 4.2 


protocol for solving EQUALITY with n = 2 (i.e., /(x, y) = 1 if and only if x = y). Alice 
begins by sending her first bit. If Bob’s first bit is different, he terminates the protocol and 
announces “not equal.” If Bob’s first bit is the same, then he transmits the same bit back. 
In this case, Alice then sends her second bit. At this point. Bob knows Alice’s whole input 
and can therefore compute the correct answer. 

In Figure [4^ each node corresponds to a possible state of the protocol, and is labeled 
with the player whose turn it is to speak. Thus the labels alternate with the levels, with 
the root belonging to Alice0 There are 10 leaves, representing the possible end states of 
the protocol. There are two leaves for the case where Alice and Bob have different hrst 
bits and the protocol terminates early, and eight leaves for the remaining cases where Alice 
and Bob have the same hrst bit. Note that the possible transcripts of the protocol are 
in one-to-one correspondence with the root-leaf nodes of the tree — we use leaves and 
transcripts interchangeably below. 

We can view the leaves as a partition {Z(£)} of the input space X xY, with Z{i) the 
inputs (x, y) such that the protocol terminates in the leaf i. In our example, there are 10 
leaves for the 16 possible inputs (x, y), so different inputs can generate the same transcript 
— more on this shortly. 

Next note that we can represent a function (from (x, y) to {0,1}) a matrix. In contrast 
to the visualization exercise above, we’ll use this matrix representation all the time. The 
rows are labeled with the set X of possible inputs of Alice, the columns with the set Y of 
possible inputs of Bob. Entry (x, y) of the matrix is /(x, y). Keep in mind that this matrix 
is fully known to both Alice and Bob when they agree on a protocol. 


^In general, players need not alternate turns in a communication protocol. 
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We also write out the matrix for Disjointness, which is somewhat more inscrutable: 


00 

00 / 1 


01 

10 

11 


V 1 


01 

1 

0 

1 

0 


10 

1 

1 

0 

0 


11 

1 \ 
0 
0 
0 / 


4.2.4 Protocols and Rectangles 

How can we reason about the behavior of a protocol? Just visualizing them as trees is not 
directly useful. We know that simple Pigeonhole Principle-based arguments are not strong 
enough, but it still feels like we want some kind of counting argument. 


To see what might be true, let’s run the 2-bit EQUALITY protocol depicted in Figure 4.2 


and track its progress using the matrix in (4.1). Put yourself in the shoes of an outside 


observer, who knows neither x nor y, and makes inferences about (x, y) as the protocol 
proceeds. When the protocol terminates, we’ll have carved up the matrix into 10 pieces, 
one for each leaf of protocol tree — the protocol transcript reveals the leaf to an outside 
observer, but nothing more. 

Before the protocol beings, all 16 inputs are fair game. After Alice sends her first bit, 
the outside observer can narrow down the possible inputs into a set of 8 — the top 8 if Alice 
sent a 0, the bottom 8 if she sent a 1. The next bit sent gives away whether or not Bob’s first 
bit is a 0 or 1, so the outsider observer learns which quadrant the input lies in. Interestingly, 
in the northeastern and southwestern quadrants, all of the entries are 0. In these cases, 
even though ambiguity remains about exactly what the input (x, y) is, the function’s value 
/(x,y) has been determined (it is 0, whatever the input). It’s no coincidence that these 


two regions correspond to the two leaves of the protocol in Figure 4.2 that stop early, with 
the correct answer. If the protocol continues further, then Alice’s second bit splits the 
northwestern and southeastern quadrants into two, and Bob’s final bit splits them again, 
now into singleton regions. In these cases, an outside observer learns the entire input (x, y) 
from the protocol’s transcript 

What have we learned? We already knew that every protocol induces a partition of the 
input space X x Y, with one set for each leaf or, equivalently, for each distinct transcript. 
At least for the particular protocol that we just studied, each of the sets has a particularly 
nice submatrix form (Figure [4.3[). This is true in general, in the following sense. 


■^It’s also interesting to do an analogous thought experiment from the perspective of one of the players. 
For example, consider Bob’s perspective when the input is (00,01). Initially Bob knows that the input lies 
in the second column but is unsure of the row. After Alice’s first message. Bob knows that the input is in 
the second column and one of the first two rows. Bob still cannot be sure about the correct answer, so the 
protocol proceeds. 
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Figure 4.3 The partition of the input space X xY according to the 10 different transcripts that 
can be generated by the Equality protocol. 


Lemma 4.1 (Rectangles) For every transcript z of a deterministic protocol P, the set of 
inputs (x, y) that generate z are a rectangle, of the form Ax B for AFX and B FY. 


A rectangle just means a subset of the input space X x Y that can be written as 
a product. For example, the set {(00,00), (11,11)} is not a rectangle, while the set 
{(00, 00), (11, 00), (00,11), (11,11)1 is. In general, a subset S' C X x y is a rectangle 
if and only if it is closed under “mix and match,” meaning that whenever (xi,yi) and 
(x 2 ,y 2 ) are in S, so are (xi,y 2 ) and (x 2 ,yi) (see the Exercises). 

Don’t be misled by our example (Figure 4.31, where the rectangles induced by our 
protocol happen to be “contiguous.” For example, if we keep the protocol the same but 
switch the order in which we write down the rows and columns corresponding to 01 and 10, 
we get an analogous decomposition in which the two large rectangles are not contiguous. In 
general, you shouldn’t even think of X and Y as ordered sets. Rectangles are sometimes 
called combinatorial rectangles to distinguish them from “geometric” rectangles and to 
emphasize this point. 

Lemma 4.1 is extremely important, though its proof is straightforward — we just follow 
the protocol like in our example above. Intuitively, each step of a protocol allows an outside 
observer to narrow down the possibilities for x while leaving the possibilities for y unchanged 
(if Alice speaks) or vice versa (if Bob speaks). 


Proof of Lemma \4.1\ Fix a deterministic protocol P. We proceed by induction on the number 
of bits exchanged. For the base case, all inputs X xY begin with the empty transcript. For 
the inductive step, consider an arbitrary t-bit transcript-so-far z generated by P, with t > 1. 
Assume that Alice was the most recent player to speak; the other case is analogous. Let 
zl denote z with the final bit b G {0,1} lopped off. By the inductive hypothesis, the set of 
inputs that generate z' has the form Ax B. Let Ai^F A denote the inputs x G A such that, 
in the protocol P, Alice sends the bit b given the transcript zl. (Recall that the message 
sent by a player is a function only of his or her private input and the history of the protocol 
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so far.) Then the set of inputs that generate z are A), x B, completing the inductive step. 


Note that Lemma 4.1 makes no reference to a function / — it holds for any deterministic 


we can 


protocol, whether or not it computes a function that we care about. In Figure 4.3 
clearly see an additional property of all of the rectangles — with respect to the matrix 
in (4.11, every rectangle is monochromatic, meaning all of its entries have the same value. 
This is true for any protocol that correctly computes a function /. 


Lemma 4.2 If a deterministic protocol P computes a function f, then every rectangle 
induced by P is monochromatic in the matrix M{f). 

Proof: Consider an arbitrary combinatorial rectangle Ax B induced by P, with all inputs 
in A X B inducing the same transcript. The output of P is constant on A x B. Since P 
correctly computes /, / is also constant on A x B. ■ 

Amazingly, the minimal work we’ve invested so far already yields a powerful technique 
for lower bounding the deterministic communication complexity of functions. 


Theorem 4.3 Let f be a function such that every partition of M{f) into monochromatic 
rectangles requires at least t rectangles. Then the deterministic communication complexity of 
f is at least log 2 1. 


Proof: A deterministic protocol with communication cost c can only generate 2‘^ distinct 
transcripts — equivalently, its (binary) protocol tree can only have 2^ leaves. If such a 
protocol computes the function /, then by Lemmas 4.1 and 4.2 it partitions M{f) into at 
most 2‘^ monochromatic rectangles. By assumption, 2^ >t and hence c > log 2 t. ■ 


Rather than applying Theorem 4.3 directly, we’II almost always be able to prove a 
stronger and simpler condition. To partition a matrix, one needs to cover all of its entries 
with disjoint sets. The disjointness condition is annoying. So by a covering of a 0-1 matrix, 
we mean a collection of subsets of entries whose union includes all of its elements — overlaps 


between these sets are allowed. See Figure 4.4 


Corollary 4.4 Let f be a function such that every covering of M(f) by monochromatic 
rectangles requires at least t rectangles. Then the deterministic communication complexity of 
f is at least log 2 1. 

Communication complexity lower bounds proved using covers — including all of those 
proved in Section |4.2.5| — automatically apply also to more general “nondeterministic” 
communication protocols, as well as randomized protocols with 1-sided error. We’ll discuss 
this more next lecture, when it will be relevant. 
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Figure 4.4 A covering by four monochromatic rectangles that is not a partition. 

4.2.5 Lower Bounds for Equality and Disjointness 

Armed with Corollary |4.4[ we can quickly prove communication lower bounds for some 
functions of interest. For example, recall that when / is the EQUALITY function, the matrix 
M(/) is the identity. The key observation about this matrix is: a monochromatic rectangle 
that includes a “1” contains only one element. The reason is simple: such a rectangle is not 
allowed to contain any O’s since it is monochromatic, and if it included a second 1 it would 
pick up some 0-entries as well (recall that rectangles are closed under “mix and match”). 
Since there are 2”' I’s in the matrix, every covering by monochromatic rectangles (even of 
just the I’s) has size 2"'. 

Corollary 4.5 The deterministic communication complexity of EQUALITY is at least n|^ 
The exact same argument gives the same lower bound for the Greater-Than function. 

Corollary 4.6 The deterministic communication complexity of Greater-Than is at least 

n. 


We can generalize this argument as follows. A fooling set for a function / is a subset 
F C X X Y of inputs such that: 

(i) / is constant on F] 

(ii) for each distinct pair (xi,yi), (x 2 ,y 2 ) £ F, at least one of (xi,y 2 ), (x 2 ,yi) has the 
opposite /-value. 

■^The Os can be covered using another 2" monochromatic rectangles, one per row (rectangles need not be 
“contiguous”!). This gives a lower bound of n + 1. The trivial upper has Alice sending her input to Bob and 
Bob announcing the answer, which is a (n-f l)-bit protocol. Analogous “+1” improvements are possible for 
the other examples in this section. 
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Since rectangles are closed under the “mix and match” operation, (i) and (ii) imply that 
every monochromatic rectangle contains at most one element of F. 

Corollary 4.7 If F is a fooling set for f, then the deterministic communication complexity 
of f is at least log 2 |i^|. 

For Equality and Greater-Than, we were effectively using the fooling set F = {(x,x) : 

X e {0,1}’^}. 

The fooling set method is powerful enough to prove a strong lower bound on the 
deterministic communication complexity of DiSJOINTNESS. 


Corollary 4.8 The deterministic communication complexity of DiSJOINTNESS is at least 
n. 


Proof: Take F = {(x, 1 — x) : x G {0,1}”} — or in set notation, {(5, : S C 

(1,2,..., re}}. The set F is a fooling set — it obviously consists only of “yes” inputs of 
Disjointness, while for every S' 7 ^ T, either S n 7 ^ 0 or T n S'^ 7 ^ 0 (or both). See 
Figure 4.5 Since |F| = 2”, Corollary 4.7 completes the proof. ■ 



Figure 4.5 If S and T are different sets, then either S and r° or T and S° are not disjoint. 


4.2.6 Take-Aways 

A key take-away point from this section is that, using covering arguments, we can prove 
the lower bounds that we want on the deterministic communication complexity of many 
functions of interest. These lower bounds apply also to nondeterministic protocols (discussed 
next week) and randomized protocols with 1-sided error. 

As with one-way communication complexity, proving stronger lower bounds that apply 
also to randomized protocols with two-sided error is more challenging. Since we’re usually 
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perfectly happy with a good randomized algorithm — recall the F 2 estimation algorithm 
from Section [l.4| — such lower bounds are very relevant for algorithmic applications. They 
are our next topic. 


4.3 Randomized Protocols 


4.3.1 Default Parameter Settings 

Our discussion of randomized one-way communication protocols in Section |2.2| remains 
equally relevant for general protocols. Our “default parameter settings” for such protocols 
will be the same. 

Public coins. By default, we work with public-coin protocols, where Alice and Bob 
have shared randomness in the form of an infinite sequence of perfectly random bits written 
on a blackboard in public view. Such protocols are more powerful than private-coin protocols. 


but not by much (Theorem 4.9). Recall that public-coin randomized protocols are equivalent 


to distributions over deterministic protocols. 

Two-sided error. We allow a protocol to error with constant probability (| by default), 
whether or not the correct answer is “1” or “0.” This is the most permissive error model. 

Arbitrary constant error probability. Recall that all constant error probabilities in 
(0, are the same — changing the error changes the randomized communication complexity 
by only a constant factor (by the usual “independent trials” argument, detailed in the 
exercises). Thus for upper bounds, we’ll be content to achieve error 49%; for lower bounds, 
it is enough to rule out low-communication protocols with error %1. 

Worst-case communication. We dehne the communication cost of a randomized 
protocol as the maximum number of bits ever communicated, over all choices of inputs and 
coin flips. Measuring the expected communication (over the protocol’s coin flips) could 
reduce the communication complexity of a problem, but only by a constant factor. 


4.3.2 Newman’s Theorem: Public- vs. Private-Coin Protocols 

We mentioned a few times that, for our purposes, it usually won’t matter whether we 
consider public-coin or private-coin randomized protocols. What we meant is the following 
result. 


Theorem 4.9 (Newman’s Theorem (1991)) If there is a public-coin protocol for a func¬ 
tion f with n-bit inputs that has two-sided error 1/3 and communication cost c, then there 
is a private-coin protocol for the problem that has two-sided error 1/3 and communication 
cost 0{c-\- logn). 


Thus, for problems with public-coin randomized communication complexity n(logn), 
like most of the problems that we’ll study in this course, there is no difference between the 
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communication complexity of the public-coin and private-coin variants (modulo constant 
factors). 

An interesting exception is Equality. Last lecture, we gave a public-coin protocol — 
one-way, even — with constant communication complexity. Theorem |4.9| only implies an 
upper bound of O(logn) communication for private-coin protocols. (One can also give such 
a private-coin protocol directly, see the Exercises.) There is also a matching lower bound 
of n(logn) for the private-coin communication complexity of Equality. (This isn’t very 
hard to prove, but we won’t have an occasion to do it.) Thus public-coin protocols can save 
0(logn) bits of communication over private-coin protocols, but no more. 


Proof of Theorem f.9' Let P denote a public-coin protocol with two-sided error 1/3. We 
begin with a thought experiment. Fix an input (x, y), with x,y G {0,1}”. If we run 
P on this input, a public string ri of random bits is consumed and the output of the 
protocol is correct with probability at least 2/3. If we run it again, a second (independent) 
random string is consumed and another (independent) answer is given, again correct 
with probability at least 2/3. After t such trials and the consumption of random strings 
ri, ..., rt, P produces t answers. We expect at least 2/3 of these to be correct, and Chernoff 
bounds (with b = 0(1) and \x = 0(t)) imply that at least 60% of these answers are correct 
with probability at least 1 — exp{ —0(t)}. 

We continue the thought experiment by taking a Union Bound over the 2” • 2” = 2^” 
choices of the input (x, y). With probability at least 1 — 2^” • exp{— 0(t)} over the choice of 
ri, ..., rt, for every input (x, y), running the protocol P with these random strings yields at 
least .6t (out of t) correct answers. In this event, the single sequence ri, ..., of random 
strings “works” simultaneously for all inputs (x, y). Provided we take t = cn with a large 
enough constant c, this probability is positive. In particular, such a set ri, ..., of random 
strings exist. 

Here is the private-coin protocol. 


(0) Before receiving their inputs, Alice and Bob agree on a set of strings ri,..., rt with 
the property that, for every input (x, y), running P t times with the random strings 
ri,..., rt yields at least 60% correct answers. 

(1) Alice picks an index i G {1, 2, ... ,t} uniformly at random and sends it to Bob. This 
requires ~ log 2 t = 0(logn) bit of communication (recall t = 0(n)). 

(2) Alice and Bob simulate the private-coin protocol P as if they had public coins given 
by ri. 

By the defining property of the r/s, this (private-coin) protocol has error 40%. As usual, 
this can be reduced to 1/3 via a constant number of independent repetitions followed by 
taking the majority answer. The resulting communication cost is 0(c -|- logn), as claimed. 
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We stated and proved Theorem 4.9 for general protocols, but the exact same statement 


holds (with the same proof) for the one-way protocols that we studied in Lectures mi 


4.3.3 Distributional Complexity 

Randomized protocols are significantly harder to reason about than deterministic ones. For 
example, we’ve seen that a deterministic protocol can be thought of as a partition of the input 
space into rectangles. A randomized protocol is a distribution over such partitions. While a 
deterministic protocol that computes a function / induces only monochromatic rectangles, 
this does not hold for randomized protocols (which can err with some probability). 

We can make our lives somewhat simpler by using Yao’s Lemma to translate distributional 
lower bounds for deterministic protocols to worst-case lower bounds for randomized protocols. 
Recall the lemma from Lecture (Lemma 


2.31 


Lemma 4.10 (Yao 1983) Let D be a distribution over the space of inputs (x, y) to a 


communication problem, and e G (0, 2 )- Suppose that every deterministic protocol P with 


Pr{y,,y)^D[P wrong on (x,y)] < e 


has communication cost at least k. Then every (public-coin) randomized protocol R with 
(two-sided) error at most e on every input has communication cost at least k. 


We proved Lemma 2.3 in Lecture]^ for one-way protocols, but the same proof holds 
verbatim for general communication protocols. Like in the one-way case. Lemma |2.3| is a 
“complete” proof technique — whatever the true randomized communication complexity, 
there is a hard distribution D over inputs that can in principle be used to prove it (recall 
the Exercises). 

Summarizing, proving lower bounds for randomized communication complexity reduces 
to: 


1. Figuring out a “hard distribution” D over inputs. 

2. Proving that every low-communication deterministic protocol has large error w.r.t. 
inputs drawn from D. 

Of course, this is usually easier said than done. 


4.3.4 Case Study: Disjointness 
Overview 

We now return to the DiSJOINTNESS problem. In Lecture we proved that the one-way 
randomized communication complexity of this problem is linear (Theorem 2.2). We did this 


by reducing INDEX to DiSJOINTNESS — the former is just a special case of the latter, where 
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one player has a singleton set (i.e., a standard basis vector). We used Yao’s Lemma (with D 
the uniform distribution) and a counting argument (about the volume of small-radius balls 
in the Hamming cube, remember?) to prove that the one-way randomized communication 
complexity of Index is H(n). Unfortunately, for general communication protocols, the 
communication complexity of INDEX is obviously O(logn) — Bob can just send his index 
i to Alice using ~ log 2 n bits, and Alice can compute the function. So, it’s back to the 
drawing board. 

The following is a major and useful technical achievement. 


Theorem 4.11 (Kalyanasundaram and Schnitger 1992^ Razborov 1992) The ran¬ 
domized communication complexity of DiSJOINTNESS is H(n). 


4.11 


Theorem 4.11 was originally proved in Kalyanasundaram and Schnitger (1992); the simplified 


proof in Razborov (|1992|) has been more influential. More recently, all the cool kids prove 
Theorem 


using “information complexity” arguments; see Bar-Yossef et al. (2002a |. 


If you only remember one result from the entire field of communication complexity, it 


should be Theorem 4.11 The primary reason is that the problem is unreasonably effective for 
proving lower bounds for other algorithmic problems — almost every subsequent lecture will 


include a significant example. Indeed, many algorithm designers simply use Theorem 4.11 
as a “black box” to prove lower bounds for other problems, without losing sleep over its 
proof" ’ As a bonus, proofs of Theorem 4.11 tend to showcase techniques that are reusable 


in other contexts. 

For a trivial consequence of Theorem |4.11| — see future lectures for less obvious ones — 
let’s return to the setting of streaming algorithms. Lectures and [^considered only one-pass 
algorithms. In some contexts, like a telescope that generates an exobyte of data per day, this 
is a hard constraint. In other settings, like database applications, a small constant number 
of passes over the data might be feasible (as an overnight job, for example). Communication 
complexity lower bounds for one-way protocols say nothing about two-pass algorithms, while 


those for general protocols do. Using Theorem 4.11, all of our H(m) space lower bounds for 
1-pass algorithms become Q{m/p) space lower bounds for p-pass algorithms, via the same 
reductionsj^ For example, we proved such lower bounds for computing Too, the highest 
frequency of an element, even with randomization and approximation, and for computing 
Fq or F 2 exactly, even with randomization. 

So how would one go about proving Theorem |4.11 Recall that Yao’s Lemma reduces 
the proof to exhibiting a hard distribution D (a bit of dark art) over inputs and proving 


^Similar to, for example, the POP Theorem and the Parallel Repetition Theorem in the context of 


hardness of approximation (see e.g. Arora and Lund (19971). 

®There’s no shame in this — life is short and there’s lots of theorems that need proving. 

p-pass space-s streaming algorithm S induces a communication protocol with 0(ps) communication, 
where Alice and Bob turn their inputs into data streams, repeatedly feed them into S, repeatedly sending 
the memory state of S back and forth to continue the simulation. 
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that all low-communication deterministic protocols have large error with respect to D (a 
potentially tough math problem). We next discuss each of these steps in turn. 


Choosing a Hard Distribution 

The uniform distribution over inputs is not a hard distribution for DiSJOINTNESS. What 
is the probability that a random input (x, y) satisfies /(x, y) = 1? Independently in each 
coordinate i, there is a 25% probability that Xi = m = 1. Thus, /(x, y) = 1 with probability 
(3/4)”. This means that the zero-communication protocol that always outputs “not disjoint” 
has low error with respect to this distribution. The moral is that a hard distribution D must, 
at the very least, have a constant probability of producing both “yes” and “no” instances. 

The next idea, motivated by the Birthday Paradox, is to define D such that each of Alice 
and Bob receive a random subset of {1, 2,..., re} of size ~ ^/n. Elementary calculations 
show that a random instance (x, y) from D has a constant probability of satisfying each of 
/(x,y) = 1 and /(x,y) = 0. 

An obvious issue with this approach is that there is a trivial deterministic protocol that 
uses 0{^/n log n) communication and has zero error: Alice (say) just sends her whole input 
to Bob by describing each of her ^/n elements explicitly by name (~ log 2 re bits each). So 
there’s no way to prove a linear communication lower bound using this distribution. |Babai 
) prove that one can at least prove a Vt{^/n) communication lower bound using 
this distribution, which is already quite a non-trivial result (more on this below). They also 
showed that for every product distribution D — meaning whenever the random choices of 
X and of y are independent — there is a zero-error deterministic protocol that uses only 
0{y/n\ogn) bits of communication (see the Exercises) 

Summarizing, if we believe that Disjointness really requires D(re) communication to 
solve via randomized protocols, then we need to find a distribution D that meets all of the 
following criteria. 


et al. (1986 


1. There is a constant probability that /(x, y) = 1 and that /(x, y) = 0. (Otherwise, a 
constant protocol works.) 

2. Alice and Bob need to usually receive inputs that correspond to sets of size D(re). 
(Otherwise, one player can explicitly communicate his or her set.) 


4. It must be mathematically tractable to prove good lower bounds on the error of all 
deterministic communication protocols that use a sublinear amount of communication. 

®This does not imply that a linear lower bound is impossible. The proof of the converse of Lemma 2.3 — 
that a tight lower bound on the randomized communication complexity of a problem can always be proved 
through a distributional lower bound for a suitable choice of U — generally makes use of distributions in 
which the choices of x and y are correlated. 


3. The random inputs x and y are correlated. (Otherwise, the upper bound from Babai 


et al. (1986) applies.) 















4.3 Randomized Protocols 


65 


Razborov (19921 proposed a distribution that obviously satisfies the hrst three properties 


and, less obviously, also satishes the fourth. It is: 


1. With probability 75%: 


a) (x, y) is chosen uniformly at random subject to: 

i. X, y each have exactly n/4 I’s; 

ii. there is no index i G {1, 2,..., n} with Xi = yi = 1 (so /(x, y) = 1). 
2. With probability 25%: 


a) (x, y) is chosen uniformly at random subject to: 

i. X, y each have exactly n/4 I’s; 

ii. there is exactly one index i G {1, 2,..., n} with Xi = yi = 1 (so /(x, y) = 0). 

Note that in both cases, the constraint on the number of indices i with Xi = yi = 0 creates 
correlation between the choices of x and y. 


Proving Error Lower Bounds via Corruption Bounds 


Even if you’re handed a hard distribution over inputs, there remains the challenging task of 
proving a good error lower bound on low-communication deterministic protocols. There are 
multiple methods for doing this, with the corruption method being the most successful one 
so far. We outline this method next. 

At a high level, the corruption method is a natural extension of the covering arguments 


of Section 4.2 to protocols that can err. Recall that for deterministic protocols, the covering 


approach argues that every covering of the matrix M (/) of the function / by monochromatic 
rectangles requires a lot of rectangles. In our examples, we only bothered to argue about 
the 1-inputs of the functionj^ We’ll do something similar here, weighted by the distribution 
D and allowing errors — arguing that there’s significant mass on the 1-inputs of /, and that 
a lot of nearly monochromatic rectangles are required to cover them all. 

Precisely, suppose you have a distribution D over the inputs of a problem so that the 
“1-mass” of D, meaning Pr('x_y)r..£)[/(x, y) = 1], is at least a constant, say .5. The plan is to 
prove two properties. 


(1) For every deterministic protocol P with error at most a sufficiently small constant e, at 
least 25% of the 1-mass of D is contained in “almost monochromatic 1-rectangles” of P 
(dehned below). We’ll see below that this is easy to prove in general by an averaging 
argument. 

®Since / has only two outputs, it’s almost without loss to pick a single output 2 ; G {0,1} of / and lower 
bound only the number of monochromatic rectangles needed to cover all of the 2 ’s. 
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(2) An almost monochromatic 1-rectangle contains at most mass of the distribution 
D, where c is as large as possible (ideally c = n(n)). This is the hard step, and the 
argument will be different for different functions / and different input distributions D. 


If we can establish (1) and (2), then we have a lower bound of n(2 on the number of 
rectangles induced by P, which proves that P uses communication 12(c) 


10 


Here’s the formal definition of an almost monochromatic 1-rectangle (AMIR) R = A x B 
of a matrix M{f) with respect to an input distribution D: 


Pr(x,y)~D[(x,y) G R and /(x,y) = 0] < 8e • Pr(x,y)^z)[(x, y) G R and /(x,y) = 1] . (4.2) 

Here’s why property (1) is true in general. Let P be a deterministic protocol with 
error at most e with respect to D. Since P is deterministic, it partitions the matrix M{f) 
into rectangles, and in each rectangle. P’s output is constant. Let Ri,..., Ri denote the 
rectangles in which P outputs “1.” 

At least 50% of the 1-mass of P — and hence at least 25% of P’s overall mass — must 
be contained in Pi,..., Ri. For if not, on at least 25% of the mass of P, /(x, y) = 1 while P 
outputs “0”. This contradicts the assumption that P has error e with respect to P (provided 
e < .25). 


Also, at least 50% of the mass in Ri,..., Rf must lie in AMlRs. For if not, using (4.2) 


and the fact that the total mass in Pi,..., P^ is at least .25, it would follow that P places 
more than 8e • .125 = e mass on 0-inputs in Pi,..., P^. Since P outputs “1” on all of these 
inputs, this contradicts the assumption that P has error at most e with respect to P. This 
completes the proof of step (1), which applies to every problem and every distribution P 
over inputs with 1-mass at least .5. 


Step (2) is difficult and problem-specific. Babai et al. (19861, for their input distribution 
P over Disjointness inputs mentioned above, gave a proof of step (2) with c = Q{^/n), 
thus giving an lower bound on the randomized communication complexity of the 

problem. Razboro^ (1992) gave, for his input distribution, a proof of step (2) with c = H(n), 
implying the desired lower bound for DiSJOINTNESS. Sadly, we won’t have time to talk 


about these and subsequent proofs (as in Bar-Yossef et al. (2002a)); perhaps in a future 


course. 


^'^Why call it the “corruption method”? Because the argument shows that, if a deterministic protocol has 
low communication, then most of its induced rectangles that contain 1-inputs are also “corrupted” by lots of 
0-inputs — its rectangles are so big that \A.2\ fails. In turn, this implies large error. 
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Lower Bounds for the Extension Complexity of Polytopes 


5.1 Linear Programs, Polytopes, and Extended Formulations 

5.1.1 Linear Programs for Combinatorial Optimization Problems 

You’ve probably seen some polynomial-time algorithms for the problem of computing a 
maximum-weight matching of a bipartite graphj^ Many of these, like the Kuhn-Tncker 
algorithm (Kuhn 19551, are “combinatorial algorithms” that operate directly on the graph. 


Linear programming is also an effective tool for solving many discrete optimization prob¬ 
lems. For example, consider the following linear programming relaxation of the maximum- 
weight bipartite matching problem (for a weighted bipartite graph G = (U, V, E,w)): 


max WeXe 
eeE 


(5.1) 


subject to 

Xe < 1 (5.2) 

e^5(v) 

for every vertex v € U UV (where 5{v) denotes the edges incident to v) and 

Xe > 0 (5.3) 


for every edge e € E. 

In this formulation, each decision variable Xe is intended to encode whether an edge e is 
in the matching (xg = 1) or not (xg = 0). It is easy to verify that the vectors of {0,1}® 
that satisfy the constraints ( |5.2| ) and ( |5.3| ) are precisely the characteristic vectors of the 
matchings of G, with the objective function value of the solution to the linear program equal 
to the total weight of the matching. 

Since every characteristic vector of a matching satisfies (5.2) and (5.3), and the set of 
feasible solutions to the linear system defined by (5.2) and (5.3) is convex, the convex hull 


^Recall that a graph is bipartite if its vertex set can be partitioned into two sets U and V such that 
every edge has one endpoint in each of U, V. Recall that a matching of a graph is a subset of edges that are 
pairwise disjoint. 
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of the characteristic vectors of matchings is contained in this feasible region]^ Also note 
that every characteristic vector x of a matching is a vertex]^ of this feasible region — since 
all feasible solutions have all coordinates bounded by 0 and 1, the 0-1 vector x cannot be 
written as a non-trivial convex combination of other feasible solutions. The worry is does 
this feasible region contain anything other than the convex hull of characteristic vectors of 
matchings? Equivalently, does it have any vertices that are fractional, and hence do not 


correspond to matchings? (Note that integrality is not explicitly enforced by (5.2) or (5.3).) 


A nice fact is that the vertices of the feasible region dehned by (5.2) and (5.3) are 
precisely the characteristic vectors of matchings of G. This is equivalent to the Birkhoff-von 
Neumann theorem (see Exercises). There are algorithms that solve linear programs in 
polynomial time (and output a vertex of the feasible region, see e.g. Grotschel et al. ( 1988| )), 
so this implies that the maximum-weight bipartite matching problem can be solved efficiently 
using linear programming. 

How about the more general problem of maximum-weight matching in general (non- 


bipartite) graphs? While the same linear system (5.2) and (5.3) still contains the convex hull 


of all characteristic vectors of matchings, and these characteristic vectors are vertices of the 
feasible region, there are also other, fractional, vertices. To see this, consider the simplest 
non-bipartite graph, a triangle. Every matching contains at most 1 edge. But assigning 
= ^ for each of the edges e yields a fractional solution that satishes (5.2) and (5.3). 


X, 


This solution clearly cannot be written as a convex combination of characteristic vectors of 
matchings. 

It is possible to add to (5.2)-(5.3) additional inequalities — “odd cycle inequalities” 
stating that, for every odd cycle C of G, — (1^1 ~ — so that the resulting 

smaller set of feasible solutions is precisely the convex hull of the characteristic vectors of 
matchings. Unfortunately, many graphs have an exponential number of odd cycles. Is it 
possible to add only a polynomial number of inequalities instead? Unfortunatel y not — the 
convex hull of the characteristic vectors of matchings can have 2^^”) “facets” (IPulleyblank 


and Edmonds, 1974[)j^ We define facets more formally in Section 5.3.1 but intuitively they 


are the “sides” of a polytopej^like the 2n sides of an n-dimensional cube. It is intuitively 


^Recall that a set S C is convex if it is “filled in,” with Ax + (1 — A)y € S whenever x, y € S and 
A G [0,1]. Recall that the convex hull of a point set P C R" is the smallest (i.e., intersection of all) convex 
set that contains it. Equivalently, it is the set of all hnite convex combinations of points of P, where a 
convex combination has the form X]r=i non-negative Ai’s summing to 1 and xi, ... ,Xp G P. 

^There is an unfortunate clash of terminology when talking about linear programming relaxations of 
combinatorial optimization problems: a “vertex” might refer to a node of a graph or to a “corner” of a 
geometric set. 

^This linear programming formulation still leads to a polynomial-time algorithm, but using fairly 
heavy machinery — the “ellipsoid method” (Khachiyan 19791 and a “separation oracle” for the odd cycle 
inequalities ( [Padberg and Rao{|1982| . There are also polynomial-time combinatorial algorithms for (weighted) 
non-bipartite matching, beginning with 


Edmonds 


(19651. 


®A polytope is just a high-dimensional polygon — an intersection of halfspaces that is bounded or 
equivalently, the convex hull of a finite set of points. 
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clear that a polytope with I facets needs I inequalities to describe — it’s like cleaving a shape 
out of marble, with each inequality contributing a single cut. We conclude that there is no 
linear program with variables {xe}e£E of polynomial size that captures the maximum-weight 
(non-bipartite) matching problem. 


5.1.2 Auxiliary Variables and Extended Formulations 


The exponential lower bound above on the number of linear inequalities needed to describe 
the convex hull of characteristic vectors of matchings of a non-bipartite graph applies to 
linear systems in M^, with one dimension per edge. The idea of an extended formulation is 
to add a polynomial number of auxiliary decision variables, with the hope that radically 
fewer inequalities are needed to describe the region of interest in the higher-dimensional 
space. 

This idea might sound like grasping at straws, but sometimes it actually works. For 
example, fix a positive integer n, and represent a permutation vr G S'„ by the n-vector 
x,r = ('^(1)) ^(2), • • •, 7r(n)), with all coordinates in {1, 2,..., n}. The permutahedron is the 
convex hull of all n! such vectors. The permutahedron is known to have 2”/^ — 2 facets (see 
e.g. Goemans ( 2014[ )), so a polynomial-sized linear description would seem out of reach. 

Suppose we add auxiliary variables, yij for all i,j G {1, 2,..., n}. The intent is for 
yij to be a 0-1 variable that indicates whether or not 7r(i) = j — in this case, the yij's are 
the entries of the n x n permutation matrix that corresponds to vr. 


We next add a set of constraints to enforce the desired semantics of the yi/s (cf., (5.2) 


and (5.3)): 


for i = 1, 2,..., n; 


for j = 1, 2,..., n; and 


Z] ^ 

(5.4) 



n 

y^i - ^ 

(5.5) 

i=l 


Uij ^ 0 

(5.6) 


for all i,j G {1, 2,..., n}. We also add constraints that enforce consistency between the 
permutation encoded by the xfs and by the yij's: 


Xi = 


(5.7) 


for all i = 1, 2,..., n. 

It is straightforward to check that the vectors y 


G {0,1}"' that satisfy (5.4)-(5.6) 


are precisely the permutation matrices. For such a y corresponding to a permutation vr, 
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the constraints (5.7) force the Xj’s to encode the same permutation vr. Using again the 

I I 2 . ■ 

Birkhoff-von Neumann Theorem, every vector y G that satisfies (5.4)-(5.6) is a convex 
combination of permutation matrices (see Exercises). Constraint (5.7) implies that the Xi’s 
encode the same convex combination of permutations. Thus, if we take the set of solutions 
in that satisfy (5.4)-(5.7) and project onto the x-coordinates, we get exactly the 

permutahedron. This is what we mean by an extended formulation of a polytope. 

To recap the remarkable trick we just pulled off: blowing up the number of variables 
from n to n + reduced the number of inequalities needed from 2”'^^ to + 3re. This 
allows us to optimize a linear function over the permutahedron in polynomial time. Given a 
linear function (in the Xj’s), we optimize it over the (polynomial-size) extended formulation, 
and retain only the x-variables of the optimal solution. 

Given the utility of polynomial-size extended formulations, we’d obviously like to 
understand which problems have them. For example, does the non-bipartite matching 
problem admit such a formulation? The goal of this lecture is to develop communication 
complexity-based techniques for ruling out such polynomial-size extended formulations. We’ll 
prove an impossibility result for the “correlation polytope” (Fiorini et al. 2015); similar (but 
much more involved) arguments imply that every extended formulation of the non-bipartite 
matching problem requires an exponential number of inequalities (]Rothvo£ 2014). 


Remark 5.1 (Geometric Intuition) It may seem surprising that adding a relatively 
small number of auxiliary variables can radically reduce the number of inequalities needed to 
describe a set — described in reverse, that projecting onto a subset of variables can massively 
blow up the number of sides. It’s hard to draw (low-dimensional) pictures that illustrate 
this point. If you play around with projections of some three-dimensional polytopes onto 
the plane, you’ll observe that non-facets of the high-dimensional polytope (edges) often 
become facets (again, edges) in the low-dimensional projection. Since the number of lower¬ 
dimensional faces of a polytope can be much bigger than the number of facets — already in 
the 3-D cube, there are 12 edges and only 6 sides — it should be plausible that a projection 
could significantly increase the number of facets. 


5.2 Nondeterministic Communication Complexity 

The connection between extended formulations of polytopes and communication complexity 
involves nondeterministie communication complexity. We studied this model implicitly in 
parts of Lecture]^ this section makes the model explicit. 

Consider a function f : X xY ^ {0,1} and the corresponding 0-1 matrix M(f), with 
rows indexed by Alice’s possible inputs and columns indexed by Bob’s possible inputs. In 
Lecture 1^ we proved that if every covering of M{f) by monochromatic rectangle^ requires 

® Recall that a rectangle is a subset S Q X x Y that has a product structure, meaning S = Ax B for 
some A C X and B CY. Equivalently, S is closed under “mix and match:” whenever (xi,yi) and (x 2 ,y 2 ) 
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at least t rectangles, then the deterministic communication complexity of / is at least 
log 2 t (Theorem 4.3). The reason is that every communication protocol computing / with 
communication cost c induces a partition of M{f) into at most 2^^ monochromatic rectangles, 


and partitions are a special case of coverings. See also Figure 5.1 



Figure 5.1 A covering by four monochromatic rectangles that is not a partition. 


Communication complexity lower bounds that are proved through coverings are actually 
much stronger than we’ve let on thus far — they apply also to nondeterministic protocols, 
which we define next. 

You presumably already have a feel for nondeterminism from your study of the complexity 
class NP. Recall that one way to dehne NP is as the problems for which membership 
can be verified in polynomial time. To see how an analog might work with communication 
protocols, consider the complement of the EQUALITY problem, -iEquality. If a third 
party wanted to convince Alice and Bob that their inputs x and y are different, it would 
not be difficult: just specify an index i G {1, 2,..., n} for which Xi ^ yi. Specifying an 
index requires log 2 n bits, and specifying whether or not Xi = 0 and y* = 1 or Xj = 1 and 
yi = 0 requires one additional bit. Given such a specification, Alice and Bob can check the 
correctness of this “proof of non-equality” without any communication. If x 7 ^ y, there is 
always a (log 2 -|-l)-bit proof that will convince Alice and Bob of this fact; if x = y, then 
no such proof will convince Alice and Bob otherwise. This means that -iEquality has 
nondeterministic communication complexity at most log 2 n -|- 1. 

Coverings of M (/) by monochromatic rectangles are closely related to the nondetermin¬ 
istic communication complexity of /. We first show how coverings lead to nondeterministic 
protocols. It’s easiest to formally define such protocols after the proof. 

Proposition 5.2 Let f : X xY —>■ {0,1} be a Boolean function and M{f) the corresponding 
matrix. If there is a cover of the 1-entries of M(f) by t 1-rectangles, then there is a 
nondeterministic protocol that verifies /(x, y) = 1 with cost log 2 t. 


are in S, so are (xi,y 2 ) and (x 2 ,yi). A rectangle is monochromatic (w.r.t. /) if it contains only 1-entries of 
M{f) or only 0-entries of M{f). In these cases, we call it a 1-rectangle or a 0-rectangle, respectively. 
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Proof: Let Ri,..., Rt denote a covering of the Is of M{f) by 1-rectangles. Alice and Bob 
can agree to this covering in advance of receiving their inputs. Now consider the following 
scenario: 


1. A prover — a third party — sees both inputs x and y. (This is the formal model used 
for nondeterministic protocols.) 

2. The prover writes an index i G {1, 2,... ,t} — the name of a rectangle Ri — on a 
blackboard, in public view. Since Ri is a rectangle, it can be written as Ri = Ai x Bi 
with Ai ^ X, Bi C Y. 

3. Alice accepts if and only if x G Aj. 

4. Bob accepts if and only if y G ilj. 

This protocol has the following properties: 

1. If /(x, y) = 1, then there exists a proof such that Alice and Bob both accept. (Since 
/(x, y) = 1, (x, y) G Ri for some i, and Alice and Bob both accept if “z” is written on 
the blackboard.) 

2. If /(x, y) = 0, there is no proof that both Alice and Bob accept. (Whatever index 
i G {1,2, ■.. ,t} is written on the blackboard, since /(x,y) = 0, either x 0 or 
y ^ Ri, causing a rejection.) 

3. The maximum length of a proof is log 2 1. (A proof is just an index i G {1, 2,..., t}.) 

These three properties imply, by definition, that the nondeterministic communication 
complexity of the function / and the output 1 is at most log 2 1. ■ 


The proof of Proposition 5.2 introduces our formal model of nondeterministic communi¬ 
cation complexity: Alice and Bob are given a “proof” or “advice string” by a prover, which 
can depend on both of their inputs; the communication cost is the worst-case length of the 
proof; and a protocol is said to compute an output z G {0,1} of a function / if /(x, y) = z 
if and only if there exists proof such that both Alice and Bob accept. 

With nondeterministic communication complexity, we speak about both a function / and 
an output z G {0,1}. For example, if / is EQUALITY, then we saw that the nondeterministic 
communication complexity of / and the output 0 is at most log 2 n -|- 1. Since it’s not clear 
how to convince Alice and Bob that their inputs are equal without specifying at least one 
bit for each of the n coordinates, one might expect the nondeterministic communication 
complexity of / and the output 1 to be roughly n. (And it is, as we’ll see.) 

We’ve defined nondeterministic protocols so that Alice and Bob never speak, and only 
verify. This is without loss of generality, since given a protocol in which they do speak, one 
could modify it so that the prover writes on the blackboard everything that they would have 
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said. We encourage the reader to formalize an alternative definition of nondeterministic 
protocols without a prover and in which Alice and Bob speak nondeterministically, and to 
prove that this definition is equivalent to the one we’ve given above (see Exercises). 

Next we prove the converse of Proposition |5.3| 

Proposition 5.3 If the nondeterministic communication complexity of the function f and 
the output 1 is c, then there is a covering of the Is of M{f) by 2^ 1-rectangles. 

Proof: Let V denote a nondeterministic communication protocol for / and the output 1 with 
communication cost (i.e., maximum proof length) at most c. For a proof i, let Z(i) denote 
the inputs (x, y) where both Alice and Bob accept the proof. We can write Z{i) = Ax B, 
where A is the set of inputs x G A of Alice where she accepts the proof i, and B is the 
set of inputs y G P of Bob where he accepts the proof. By the assumed correctness of V, 
/(x,y) = 1 for every (x, y) G Z{£). That is, Z{i) is a 1-rectangle. 

By the hrst property of nondeterministic protocols, for every 1-input (x, y) there is a 
proof such that both Alice and Bob accept. That is, [JiZ(i) is precisely the set of 1-inputs 
of / — a covering of the Is of M(/) by 1-rectangles. Since the communication cost of V is 
at most c, there are at most 2^^ different proofs i. ■ 

Proposition |5.3| implies that communication complexity lower bounds derived from 
covering lower bounds apply to nondeterministic protocols. 

Corollary 5.4 If every covering of the Is of M{ f) by 1-rectangles uses at least t rectangles, 
then the nondeterministic communication complexity of f is at least log 2 t. 

Thus our arguments in Lecture while simple, were even more powerful than we realized — 
they prove that the nondeterministic communication complexity of EQUALITY, DlSJOINT- 
NESS, and Greater-Than (all with output 1) is at least n. It’s kind of amazing that these 
lower bounds can be proved with so little work. 

5.3 Extended Formulations and Nondeterministic Communication Complexity 

What does communication complexity have to do with extended formulations? To forge a 
connection, we need to show that an extended formulation with few inequalities is somehow 
useful for solving hard communication problems. While this course includes a number of 
clever connections between communication complexity and various computational models, 
this connection to extended formulations is perhaps the most surprising and ingenious one 
of them all. Superficially, extended formulations with few inequalities can be thought of 
as “compressed descriptions” of a polytope, and communication complexity is generally 
useful for ruling out compressed descriptions of various types. It is not at all obvious that 
this vague intuition can be turned into a formal connection, let alone one that is useful for 
proving non-trivial impossibility results. 



74 


Lower Bounds for the Extension Complexity of Polytopes 


5.3.1 Faces and Facets 

We discuss briefly some preliminaries about polytopes. Let P be a polytope in variables 
X G M”. By definition, an extended formulation of P is a set of the form 

<3 = {(x,y) : Cx + Dy<d}, 

where x and y are the original and auxiliary variables, respectively, such that 

{x : 3y s.t. (x,y) G Q} = P 


This is, projecting Q onto the original variables x yields the original polytope P. The 
extended formulation of the permutahedron described in Section 5.1.2| is a canonical example. 
The size of an extended formulation is the number of inequalities]^ 

Recall that x G P is a vertex if it cannot be written as a non-trivial convex combination 
of other points in P. A supporting hyperplane of P is a vector a G M” and scalar 6 G M such 
that ax = b for all x G P. Every supporting hyperplane a, b induces a face of P, defined as 
{x G P : ax = b} — the intersection of the boundaries of P and of the the halfspace defined 
by the supporting hyperplane. (See Figure 5.2 ) Note that a face is generally induced by 
many different supporting hyperplanes. The empty set is considered a face. Note also that 
faces are nested — in three dimensions, there are vertices, edges, and sides. In general, if / 
is a face of P, then the vertices of / are precisely the vertices of P that are contained in /. 



Figure 5.2 A supporting hyperplane of a polytope P and the corresponding face of the polytope. 

^There is no need to keep track of the number of auxiliary variables — there is no point in having an 
extended formulation of this type with more variables than inequalities (see Exercises). 
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A facet of P is a maximal face — a face that is not strictly contained in any other face. 
Provided P has a non-empty interior, its facets are (n — l)-dimensional. 

There are two different types of finite descriptions of a polytope, and it is useful to go 
back and forth between them. First, a polytope P equals the convex hull of its vertices. 
Second, P is the intersection of the halfspaces that dehne its facets]^ 


5.3.2 Yannakakis’s Lemma 

What good is a small extended formulation? We next make up a contrived communication 
problem for which small extended formulations are useful. For a polytope P, in the 
corresponding Face-VERTEX (P) problem, Alice gets a face / of P (in the form of a 
supporting hyperplane a, b) and Bob gets a vertex v of P. The function FV (/, v) is defined 
as 1 if u does not belong to /, and 0 if u G /. Equivalently, FV{f,v) = 1 if and only if 
a'^v < b, where a, 6 is a supporting hyperplane that induces /. Polytopes in n dimensions 
generally have an exponential number of faces and vertices. Thus, trivial protocols for Face- 
Vertex(P), where one party reports their input to the other, can have communication 
cost f7(n). 

A key result is the following. 


Lemma 5.5 (Yannakakis’s Lemma (1991)) If the polytope P admits an extended for¬ 


mulation Q with r inequalities, then the nondeterministic communication complexity of 
Face-Vertex (P) is at most log 2 r. 

That is, if we can prove a linear lower bound on the nondeterministic communication 
complexity of the Face-Vertex(P) problem, then we have ruled out subexponential-size 
extended formulations of P. 

Sections 5.3.3 and 5.3.4 give two different proof sketches of Lemma |5.5[ These are 


roughly equivalent, with the first emphasizing the geometric aspects (following Lovasz 


(19901) and the second the algebraic aspects (following Yannakakis (1991|). In Section 


5.4 


we put Lemma [5.5|to use and prove strong lower bounds for a concrete polytope. 


Remarkably, Yannakakis (19911 did not give any applications of his lemma — the lower 

formulations 


bounds for extended formulations in Yannakakis (|1991|) are for “symmetric 
and proved via direct arguments. Lemma 


5.5 


was suggested by Yannakakis (1991| as a 


potentially useful tool for more general impossibility results, and finally in the past five 


years (beginning with Fiorini et al. (2015)) this prophecy has come to pass 


5.3.3 


Proof Sketch of Lemma 


5.5 


A Geometric Argument 


Suppose P admits an extended formulation Q = {(x, y) : Cx -|- Dy < d} with only r 
inequalities. Both P and Q are known to Alice and Bob before the protocol begins. A first 


^Proofs of all of these statements are elementary but outside the scope of this lecture; see e.g. [Ziegler 


(19951 for details. 
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idea is for Alice, who is given a face / of the original polytope P, to tell Bob the name of 
the “corresponding face” of Q. Bob can then check whether or not his “corresponding vertex” 
belongs to the named face or not, thereby computing the function. 

Unfortunately, knowing that Q is defined by r inequalities only implies that it has at 
most r facets — it can have a very large number of faces. Thus Alice can no more afford to 
write down an arbitrary face of Q than a face of P. 

We use a third-party prover to name a suitable facet of Q than enables Alice and Bob 
to compute the Face-Vertex(P) function; since Q has at most r facets, the protocol’s 
communication cost is only log 2 r, as desired. 

Suppose the prover wants to convince Alice and Bob that Bob’s vertex u of P does not 

belong to Alice’s face / of P. If the prover can name a facet f* of Q such that: 

(i) there exists such that {v,yy) 0 /*; and 

(ii) for every (x,y) € Q with x € /, (x,y) E /*; 

then this facet f* proves that u 0 /. Moreover, given /*, Alice and Bob can verify (ii) 
and (i), respectively, without any communication. 

All that remains to prove is that, when v ^ f, there exists a facet f* of Q such that (i) 
and (ii) hold. First consider the inverse image of / in Q, / = {(x, y) E Q : x E /}. 

Similarly, define fi = {(w,y) E Q}. Since v ^ f, f and v are disjoint subsets of Q. It is not 

difficult to prove that / and v, as inverse images of faces under a linear map, are faces of Q 
(exercise). An intuitive but non-trivial fact is that every face of a polytope is the intersection 
of the facets that contain itj^ Thus, for every vertex v* of Q that is contained in v (and 
hence not in /) — and since v is non-empty, there is at least one — we can choose a facet 
f* of Q that contains / (property (ii)) but excludes v* (property (i)). This concludes the 
proof sketch of Lemma |5.5| 


5.3.4 Proof Sketch of Lemma 5.5 An Algebraic Argument 


The next proof sketch of Lemma 5.5 is a bit longer but introduces some of the most important 
concepts in the study of extended formulations. 

The slack matrix of a polytope P has rows indexed by faces F and columns indexed by 
vertices V. We identify each face with a canonical supporting hyperplane a, b. Entry Sfy of 
the slack matrix is defined as 6 — a'^v, where a, b is the supporting hyperplane corresponding 
to the face /. Observe that all entries of S are nonnegative. Define the support supp(S') 
of the slack matrix S as the F x V matrix with 1-entries wherever S has positive entries, 
and 0-entries wherever S has 0-entries. Observe that supp(S') is a property only of the 
polytope P, independent of the choices of the supporting hyperplanes for the faces of P. 


^This follows from Farkas’s Lemma, or equivalently the Separating Hyperplane Theorem. See Ziegler 


(19951 for details. 
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Observe also that supp(S') is precisely the answer matrix for the Face-Vertex(P) problem 
for the polytope P. 

We next identify a sufficient condition for Face-Vertex(P) to have low nondetermin¬ 
istic communication complexity; later we explain why the existence of a small extended 
formulation implies this sufficient condition. Suppose the slack matrix S has nonnegative 
rank r, meaning it is possible to write S = TU with T a. \F\ x r nonnegative matrix and U 
a r X \ V\ nonnegative matrix (Figure 5.3),^^ Equivalently, suppose we can write S as the 


sum of r outer products of nonnegative vectors (indexed by F and V)'. 




(5.8) 


where the a^-’s correspond to the columns of T and the /3j’s to the rows of U. 



r columns indexed by u 

< -> < -> 



r rows 


Figure 5.3 A rank-r factorization of the slack matrix S into nonnegative matrices T and U. 


We claim that if the slack matrix S' of a polytope P has nonnegative rank r, then there is 
a nondeterministic communication protocol for Face-Vertex(P) with cost at most log 2 r. 
As usual, Alice and Bob can agree to the decomposition (5.8) in advance. A key observation 
is that, by inspection of (5.8), S/^ > 0 if and only if there exists some j G {1, 2,..., r} with 
afj,l3jv > 0. (We are using here that everything is nonnegative and so no cancellations 
are possible.) Equivalently, the supports of the outer products aj ■ I3j can be viewed as a 
covering of the 1-entries of supp(S) by r 1-rectangles. Given this observation, the protocol 
for Face-Vertex (P) should be clear. 


^'^This is called a nonnegative matrix factorization. It is the analog of the singular value decomposition 
(SVD), but with the extra constraint that the factors are nonnegative matrices. It obviously only makes 
sense to ask for such decompositions for nonnegative matrices (like S). 
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1. The prover announces an index j G {1, 2,..., r}. 


Alice accepts if and only if the /th component of aj is strictly positive 
accepts if and only if the fth component of /3j is strictly positive. 


5.3 


3. Bob 


The communication cost of the protocol is clearly log 2 r. The key observation above implies 
that there is a proof (i.e., an index j G {1,2,..., r}) accepted by both Alice and Bob if and 
only if Bob’s vertex v does not belong to Alice’s face /. 

It remains to prove that, whenever a polytope P admits an extended formulation with 
a small number of inequalities, its slack matrix admits a low-rank nonnegative matrix 
factorizationWe’ll show this by exhibiting nonnegative r-vectors A/ (for all faces / o f P) 
and fXy (for all vertices v of P) such that Sfy = X'jHv for all / and v. In terms of Figure 
the A/’s and /t^’s correspond to the rows of T and columns of [/, respectively. 

The next task is to understand better how an extended formulation Q = {(x, y) : 
Cx + Dy < d} must be related to the original polytope P. Given that projecting Q onto the 
variables x yields P, it must be that every supporting hyperplane of P is logically implied 
by the inequalities that define Q. To see one way how this can happen, suppose there is a 
non-negative r-vector A G M!j_ with the following properties: 

(PI) A^C = a^; 

(P2) A^D = 0; 

(P3) A^d = 6. 

(P1)-(P3) imply that, for every (x, y) in Q (and so with Cx -|- Dy < d), we have 


A^x +A^y 

=aT =0 =b 


and hence a'^x < b (no matter what y is). 

Nonnegative linear combinations A of the constraints of Q that satisfy (P1)-(P3) are one 
way in which the constraints of Q imply constraints on the values of x in the projection of 
Q. A straightforward application of Farkas’s Lemma (see e.g. Chvatal (19831) implies that 
such nonnegative linear combinations are the only way in which the constraints of Q imply 
constraints on the projection of Put differently, whenever a^x < 6 is a supporting 
hyperplane of P, there exists a nonnegative linear combination A that proves it (i.e., that 
satisfies (P1)-(P3)). This clarifies what the extended formulation Q really accomplishes: 


^^The converse also holds, and might well be the easier direction to anticipate. See the Exercises for 
details. 

^■^Farkas’s Lemma is sometimes phrased as the Separating Hyperplane Theorem. It can also be thought 
of as the feasibility version of strong linear programming duality. 
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ranging over all A G M!j_ satisfying (P2) generates all of the supporting hyperplanes a, 6 of P 
(with a and b arising as A^C and A^d, respectively). 

To define the promised A/’s and fix a face / of P with supporting hyperplane 
a'^x < b. Since Q’s projection does not include any points not in P, the constraints of Q 
imply this supporting hyperplane. By the previous paragraph, we can choose a nonnegative 
vector A/ so that (P1)-(P3) hold. 

Now fix a vertex v of P. Since Q's projection includes every point of P, there exists a 
choice of yv such that (v,yv) G Q- Define Hv G M!j_ as the slack in Q’s constraints at the 
point (u,y^): 

= d-Cv - Dy^,. 

Since {v, y^) G Q, Hv is a nonnegative vector. 

Finally, for every face / of P and vertex v of P, we have 


Xffiv = A/d — A/Cv — A/Dy^ = b — a^v = Sjy, 


=b 


=0 


as desired. This completes the second proof of Lemma 5.5 


5.4 A Lower Bound for the Correlation Polytope 


5.4.1 Overview 


Lemma 5.5 reduces the task of proving lower bounds on the size of extended formulations of 
a polytope P to proving lower bounds on the nondeterministic communication complexity 
of Face-Vertex (P). The case study of the permutahedron (Section [5.1.2 1 serves as a 
cautionary tale here: the communication complexity of Face-Vertex(P) is surprisingly 
low for some complex-seeming polytopes, so proving strong lower bounds, when they exist, 
typically requires work and a detailed understanding of the particular polytope of interest. 


Fiorini et al. (2015) were the first to use Yannakakis’s Lemma to prove lower bounds 
on the size of extended formulations of interesting polytopesj^ We follow the proof plan 
of Fiorini et al. (2015), which has two steps. 


1. First, we exhibit a polytope that is tailor-made for proving a nondeterministic com¬ 
munication complexity lower bound on the corresponding Face-Vertex(P) problem, 
via a reduction from DiSJOINTNESS. We’ll prove this step in full. 

2. Second, we extend the consequent lower bound on the size of extended formulations to 
other problems, such as the Traveling Salesman Problem (TSP), via reductions. These 
reductions are bread-and-butter iVP-completeness-style reductions; see the Exercises 
for more details. 

^^This paper won the Best Paper Award at STOC T2. 
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This two-step plan does not seem sufficient to resolve the motivating problem mentioned 
in Section |5.H the non-bipartite matching problem. For an iVP-hard problem like TSP, 
we fully expect all extended formulations of the convex hull of the characteristic vectors 
of solutions to be exponential; otherwise, we could use linear programming to obtain a 
subexponential-time algorithm for the problem (an unlikely result). The non-bipartite 
matching problem is polynomial-time solvable, so it’s less clear what to expect. Rothvoh 


(20141 proved that every extended formulation of the convex hull of the perfect matchings 
of the complete graph has exponential size 


14 

The techniques in 

Rothvofi 

(2014 


sophisticated variations of the tools covered in this lecture 
well-positioned to learn them. 


a reader of these notes is 


5.4.2 Preliminaries 


We describe a poly tope P for which it’s relatively easy to prove non deterministic commu¬ 
nication complexity lower bounds for the corresponding Face-Vertex(P) problem. The 
polytope was studied earlier for other reasons (Pitowsky, 1991; de Wolf[ 20031. 

Given a 0-1 n-bit vector x, we consider the corresponding (symmetric and rank-1) outer 
product xx^. For example, if x = 10101, then 


XX 


T 


/ 1 0 1 0 1 \ 
0 0 0 0 0 
10 10 1 
0 0 0 0 0 
V 1 0 1 0 1 / 


For a positive integer n, we define COR as the convex hull of all 2”' such vectors xx^ (ranging 
over X G {0,1}*^). This is a polytope in M”’ , and its vertices are precisely the points xx^ 
with X G {0, !}”■. 

Our goal is to prove the following result. 


Theorem 5.6 (Fiorini et al. 2015) 


Face-Vertex(COR) is 0(n). 


The nondeterministic communication complexity of 


This lower bound is clearly the best possible (up to a constant factor), since Bob can 
communicate his vertex to Alice using only n bits (by specifying the appropriate x G {0,1}""). 

Lemma 5.5 then implies that every extended formulation of the COR polytope requires 
inequalities, no matter how many auxiliary variables are added. Note the dimension d 
is 0(n^), so this lower bound has the form 

Elementary reductions (see the Exercises) translate this extension complexity lower 
bound for the COR polytope to a lower bound of 2^^^^ on the size of extended formulations 
of the convex hull of characteristic vectors of n-point traveling salesman tours. 


^^This paper won the Best Paper Award at STOC ’14. 






















5.4 A Lower Bound for the Correlation Poly tope 


81 


5.4.3 Some Faces of the Correlation Polytope 

Next we establish a key connection between certain faces of the correlation polytope and 
inputs to Disjointness. Throughout, n is a fixed positive integer. 


Lemma 5.7 (Fiorini et al. 2015) 


COR such that: for every i? C {1, 2,.. 
vertex of COR, 


For every subset S C {1,2,..., n}, there is a face fs of 
., n} with characteristic vector xr and corresponding 


vr G fs if o-'iT'd only if liS n = 1. 


That is, among the faces of COR are 2 ”' faces that encode the “unique intersection property” 
for each of the 2 "' subsets S' of {1, 2,..., n}. Note that for a given S, the sets R with |Sn ii| 
can be generated by (i) first picking a element of S; (ii) picking a subset of {1, 2,..., n} \ S. 
Thus if |S| = k, there are sets R with which it has a unique intersection. 


Lemma 5.7 is kind of amazing, but also not too hard to prove. 


Proof of Lemma 5.1’ For every S C {1, 2,..., n}, we need to exhibit a supporting hyperplane 


< b such that a-^ vr = 6 if and only if |S H = 1, where vr denotes xrxJj and xr 

the characteristic vector of i? C {1, 2,..., n}. 

Fix S C {1, 2,..., n}. We develop the appropriate supporting hyperplane, in variables 
2 

y G M" , over several small steps. 


2 

1. For clarity, let’s start in the wrong space, with variables z G rather than y G M” . 
Here z is meant to encode the characteristic vector of a set R C {1, 2,..., n}. One 
sensible inequality to start with is 


^Zi-1>0. (5.9) 

ies 


For example, if S' = {1, 3}, then this constraint reads zi + — 1 >t). 

The good news is that for 0-1 vectors xr, this inequality is satished with equality if 
and only if |S O i2| = 1. The bad news is that it does not correspond to a supporting 
hyperplane: if S and R are disjoint, then xr violates the inequality. How can we 
change the constraint so that it holds with equality for xr with |S O i?| = 1 and also 
valid for all R1 


2 . One crazy idea is to square the left-hand side of (5.9): 



(5.10) 


For example, if S = {1,3}, then the constraint reads (after expanding) zf + z'^ + 
2ziz^ — 2zi — 2^3 -|- 1 > 0 . 
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The good news is that every 0-1 vector xjj satisfies this inequality, and equality holds 
if and only if |S'n -R| = 1. The bad news is that the constraint is non-linear and hence 
does not correspond to a supporting hyperplane. 


3. The obvious next idea is to “linearize” the previous constraint. Wherever the constraint 
has a z? or a Zj, we replace it by a variable ya (note these partially cancel out). 
Wherever the constraint has a 2ziZj (and notice for i ^ j these always come in pairs), 
we replace it by a yij + yji. Formally, the constraint now reads 

-'^yii+ X] + 1 > 0. (5.11) 

isS i^j&S 


2 

Note that the new variable set is y G . For example, if 5* = {1,3}, then the new 
constraint reads yu -b ysi - yii - 2/33 > -!• 




A first observation is that, for y’s that are 0-1, symmetric, and rank-1, with y = zz 
(hence yij = Zi ■ zj for i,j G {1,2,... ,n}), the left-hand sides of (5.10) and (5.11) are 
the same by definition. Thus, for y = x^jx^J with x G {0,1}”, y satisfies the (linear) 
inequality (5.11), and equality holds if and only if [S' H = 1. 


We have shown that, for every S C {l,2,...,n}, the linear inequality (5.11) is satisfied 
by every vector y G M”' of the form y = x/jx}J with x G {0,1}”. Since COR is by definition 
the convex hull of such vectors, every point of COR satisfies (5.11). This inequality is 
therefore a supporting hyperplane, and the face it induces contains precisely those vertices 
of the form x/jx}J with [S' H i?| = 1. This completes the proof. ■ 


5.4.4 Face-Vertex(COR) and Unique-Disjointness 


In the Face-Vertex(COR) problem, Alice receives a face / of COR and Bob a vertex v of 
COR. In the 1-inputs, v 0 /; in the 0-inputs, v G /. Let’s make the problem only easier by 
restricting Alice’s possible inputs to the 2 "' faces (one per subset S C {1,2,...,n}) identified 


in Lemma 5.7 In the corresponding matrix Mjj of this function, we can index the rows by 
subsets S. Since every vertex of COR has the form y = x^jx^J for R C { 1 , 2,..., nj, we can 
index the columns of Mjj by subsets R. By Lemma 5.7, the entry (S, R) of the matrix Mjj 


is 1 if |5 n i 2 | 7 b 1 and 0 of [S' n i2| = 1. That is, the 0-entries of Mjj correspond to pairs 
(5, R) that intersect in a unique element. 

There is clearly a strong connection between the matrix Mjj above and the analogous 
matrix Md for Disjointness. They differ on entries {S,R) with |S'n ii| > 2: these are 
0-entries of but 1-entries of Mjj. In other words, Mjj is the matrix corresponding to 
the communication problem -iUnique-Intersection: do the inputs S and R fail to have 
a unique intersection? 

The closely related Unique-Disjointness problem is a “promise” version of DISJOINT¬ 
NESS. The task here is to distinguish between: 
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(1) inputs {S,R) of Disjointness with |5ni2| = 0; 

(0) inputs {S,R) of Disjointness with |5ni2| = 1. 

For inputs that fall into neither case (with [S' H > 1), the protocol is off the hook — any 
output is considered correct. Since a protocol that solves Unique-Disjointness has to do 
only less than one that solves -iUnique-Intersection, communication complexity lower 
bounds for former problem apply immediate to the latter. 

We summarize the discussion of this section in the following proposition. 


Proposition 5.8 (Fiorini et al. 2015) The nondeterministic communication complexity 
of Face-Vertex(COR) is at least that of Unique-Disjointness. 


5.4.5 A Lower Bound for Unique-Disjointness 

The Goal 

One hnal step remains in our proof of Theorem |5.6| and hence of our lower bound on the 
size of extended formulations of the correlation polytope. 


Theorem 5.9 (Fiorini et al. 2015 Kaibel and Weltge||20i^ The nondeterministic 
communication complexity of Unique-Disjointness is D(n). 


Disjointness Revisited 


As a warm-up, we revisit the standard DISJOINTNESS problem. Recall that, in Lecture]^ 
we proved that the nondeterministic communication complexity of DISJOINTNESS is at least 
n by a fooling set argument (Corollary 4.8). Next we prove a slightly weaker lower bound, 
via an argument that generalizes to Unique-Disjointness. 

The first claim is that, of the 2"' x 2” = 4T possible inputs of Disjointness, exactly S"" of 
them are 1-inputs. The reason is that the following procedure, which makes n 3-way choices, 
generates every 1 -input exactly once: independently for each coordinate i = 1 , 2 ,... ,n, 
choose between the options (i) Xi = yi = 0 ; (ii) Xi = 1 and yi = 0 ; and (iii) Xj = 0 and 


Vi = 1 - 

The second claim is that every 1-rectangle — every subset A of rows of and B of 
columns of Mjo such that Ax B contains only 1-inputs — has size at most 2”. To prove 
this, let R = A X R be a 1-rectangle. We assert that, for every coordinate i = 1, 2,..., n, 
either (i) x* = 0 for all x G A or (ii) yi = 0 for all y £ B. That is, every coordinate has, 
for at least one of the two parties, a “forced zero” in R. For if neither (i) nor (ii) hold for 
a coordinate i, then since R is a rectangle (and hence closed under “mix and match”) we 
can choose (x, y) G R with Xj = y* = 1; but this is a 0-input and R is a 1-rectangle. This 
assertion implies that the following procedure, which makes n 2-way choices, generates 
every 1-input of R (and possibly other inputs as well): independently for each coordinate 
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i = 1,2,... ,n, set the forced zero (x* = 0 in case (i) or = 0 in case (ii)) and choose a bit 
for this coordinate in the other input. 

These two claims imply that every covering of the 1-inputs by 1-rectangles requires 
at least (3/2)” rectangles. Proposition 5.3 then implies a lower bound of Q{n) on the 
nondeterministic communication complexity of DiSJOINTNESS. 


Proof of Theorem 


5.9 


Recall that the 1-inputs (x, y) of Unique-Disjointness are the same as those of DiS¬ 
JOINTNESS (for each i, either Xi = 0, yi = 0, or both). Thus, there are still exactly 3” 
1-inputs. The 0-inputs (x, y) of Unique-Disjointness are those with Xi = yt = 1 in 
exactly one coordinate i. We call all other inputs, where the promise fails to hold, *-inputs. 
By a 1-rectangle, we now mean a rectangle with no 0-inputs (*-inputs are fine). With 
this revised definition, it is again trne that every nondeterministic communication protocol 
that solves Unique-Disjointness using c bits of communication induces a covering of the 
1 -inputs by at most 2 ^ 1 -rectangles. 


Lemma 5.10 Every 1-rectangle of Unique-Disjointness contains at most 2” 1-inputs. 


As with the argument for DiSJOINTNESS, Lemma [5.10| completes the proof of Theorem |5.9| 
since there are 3” 1-inputs and at most 2” per 1-rectangle, every covering by 1-rectangles 
requires at least (3/2)” rectangles. This implies that the nondeterministic communication 
complexity of Unique-Disjointness is D(n). 

Why is the proof of Lemma 5.10 harder than in Section 5.4.5;* We can no longer easily 
argue that, in a rectangle R = A x B, for each coordinate i, either Xi = 0 for all x G A 
or yj = 0 for all y G R. Assuming the opposite no longer yields a contraction: exhibiting 
X G A and y G R with Xi = yi = 1 does not necessarily contradict the fact that R is a 
1 -rectangle, since (x, y) might be a *-input. 


Proof of Lemma \5.10^ The proof is one of those slick inductions that you can’t help but sit 
back and admire. 

We claim, by induction on A; = 0,1, 2,..., n, that if R = A x R is a 1-rectangle for which 
all X G A and y G R have Os in their last n — k coordinates, then the number of 1-inputs in 
R is at most 2^. The lemma is equivalent to the case of k = n. The base case k = 0 holds, 
because in this case the only possible input in R is (0, 0). 

For the inductive step, fix a 1-rectangle R = A x R in which the last n — k coordinates 
of all X G A and all y G R are 0. To simplify notation, from here on we ignore the last n — k 
coordinates of all inputs (they play no role in the argument). 

Intuitively, we need to somehow “zero out” the A:th coordinate of all inputs in R so that 
we can apply the inductive hypothesis. This motivates focusing on the A:th coordinate, and 
we’ll often write inputs x G A and y G R as x'a and y’b, respectively, with x', y' G {0,1}^“^ 
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and a,b € {0,1}. (Recall we’re ignoring that last n — k coordinates, which are now always 
zero.) 

First observe that, whenever {x'a,y'b) is a 1-input, we cannot have a = b = 1. Also: 

(*) If (x'a, y'6) € R is a 1-input, then R cannot contain both the inputs (x'0,y'l) and 
(x'l,y'0). 

For otherwise, R would also contain the 0-input (x'l,y'l), contradicting that i? is a 1- 
rectangle. (Since (x'a, y'6) is a 1-input, the unique coordinate of (x'l,y'l) with a 1 in both 
inputs is the kth coordinate.) 

The plan for the rest of the proof is to define two sets Si, S 2 of 1-inputs — not necessarily 
rectangles — such that: 

(PI) the number of 1-inputs in 5i and S 2 combined is at least that in R] 

(P2) the inductive hypothesis applies to rect(S'i) and rect(S' 2 ), where rect(S') denotes the 
smallest rectangle containing a set S of inputs}^ 

If we can find sets Si, 52 with properties (P1),(P2), then we are done: by the inductive 
hypothesis, the rect(Sj)’s have at most 1-inputs each, the Sj’s are only smaller, and 
hence (by (PI)) R has at most 2^ 1-inputs, as required. 

We define the sets in two steps, focusing first on property (PI). Recall that every 1-input 
(x, y) G R has the form (x'l,y'0), (x'0,y'l), or (x'0,y'0). We put all 1-inputs of the first 
type into a set S[, and all 1-inputs of the second type into a set 5^. When placing inputs of 
the third type, we want to avoid putting two inputs of the form (x'a, y'b) with the same 
x' and y' into the same set (this would create problems in the inductive step). So, for an 
input (x'0,y'0) G R, we put it in S[ if and only if the input (x'l,y'0) was not already put 
in and we put it in S 2 if and only if the input (x'O, y^l) was not already put in 5^. 
Crucially, observation (*) implies that R cannot contain two 1-inputs of the form (x'l,y'0) 
and (x'0,y'l), so the 1-input (x'0,y^0) is placed in at least one of the sets S[,S 2 - (It is 
placed in both if R contains neither (x'l,y'0) nor (x'0,y'l).) By construction, the sets S[ 
and S 2 satisfy property (PI). 

We next make several observations about S[ and S 2 . By construction: 

(**) for each i = 1,2 and x',y' G {0,1}*’“^, there is at most one input of S( of the form 
(x'a,y'6). 

Also, since S[,S 2 are subsets of the rectangle R, rect(S(), rect(S 2 ) are also subsets of R. 
Since i? is a 1-rectangle, so are rect(S(),rect(S 2 ). Also, since every input (x, y) of S( (and 

^^Equivalently, the closure of S under the “mix and match” operation on pairs of inputs. Formally, 
rect(S') = X{S) x FIS'), where X(S) = {x : (x,y) G S'for some y} and Y{S) = {y : (x,y) G 
S for some x}. 
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hence rect(S^)) has 2 /^ = 0 (for i = 1) or = 0 (for i = 2), the fcth coordinate contributes 
nothing to the intersection of any inputs of rect((S'() or rect(S' 2 ). 

Now obtain St from (for i = 1,2) by zeroing out the kth coordinate of all inputs. 
Since the S'^’s only contain 1-inputs, the Si's only contain 1-inputs. Since property (**) 
implies that \Si\ = jS'd for i = 1,2, we conclude that property (PI) holds also for Si, S' 2 . 

Moving on to property (P2), since rect(S'(), rect(S' 2 ) contain no 0-inputs and contain only 
inputs with no intersection in the kth coordinate, rect(5i),rect(S' 2 ) contain no 0-inputsJ^ 
Finally, since all inputs of ^i, S 2 have zeroes in their hnal re — A: -|- 1 coordinates, so do all 
inputs of rect(Si), rect(S' 2 ). The inductive hypothesis applies to rect(S'i) and rect(S 2 ), so 
each of them has at most 1-inputs. This implies the inductive step and completes the 
proof. ■ 


^®The concern is that zeroing out an input in the fcth coordinate turns some *-input (with intersection 
size 2) into a 0-input (with intersection size 1); but since there were no intersections in the fcth coordinate, 
anyways, this can’t happen. 




Lecture 6 


Lower Bounds for Data Structures 


6.1 Preamble 


Next we discuss how to use communication complexity to prove lower bounds on the 
performance — meaning space, query time, and approximation — of data structures. Our 
case study will be the high-dimensional approximate nearest neighbor problem. 

There is a large literature on data structure lower bounds. There are several different 
ways to use communication complexity to prove such lower bounds, and we’ll unfortunately 
only have time to discuss one of them. For example, we discuss only a static data structure 
problem — where the data structure can only be queried, not modified — and lower bounds 


for dynamic data structures tend to use somewhat different techniques. See Miltersen (1999) 


and Patra§cu (2008) for some starting points for further reading. 


We focus on the approximate nearest neighbor problem for a few reasons: it is obviously 
a fundamental problem, that gets solved all the time (in data mining, for example); there 
are some non-trivial upper bounds; for certain parameter ranges, we have matching lower 
bounds; and the techniques used to prove these lower bounds are representative of work 
in the area — asymmetric communication complexity and reductions from the “Lopsided 
Disjointness” problem. 


6.2 The Approximate Nearest Neighbor Problem 

In the nearest neighbor problem, the input is a set 5 of n points that he in a metric space 
{X, £). Most commonly, the metric space is Euclidean space {W^ with the £2 norm). In these 
lectures, we’ll focus on the Hamming cube, where X = {0,1}'’* and £ is Hamming distance. 
On the upper bound side, the high-level ideas (related to “locality sensitive hashing (LSH)”) 
we use are also relevant for Euclidean space and other natural metric spaces — we’ll get 
a glimpse of this at the very end of the lecture. On the lower bound side, you won’t be 
surprised to hear that the Hamming cube is the easiest metric space to connect directly 
to the standard communication complexity model. Throughout the lecture, you’ll want to 
think of d as pretty big — say d = ^/n. 

Returning to the general nearest neighbor problem, the goal is to build a data structure 
D (as a function of the point set S C A) to prepare for all possible nearest neighbor queries. 
Such a query is a point q £ X, and the responsibility of the algorithm is to use D to return 
the point p* £ S that minimizes £(p, q) over p £ S. One extreme solution is to build no 
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data structure at all, and given q to compute p* by brute force. Assuming that computing 
i{p,q) takes 0(d) time, this query algorithm runs in time 0(dn). The other extreme is 
to pre-compute the answer to every possible query q, and store the results in a look-up 
table. For the Hamming cube, this solution uses 0(d2'^) space, and the query time is that 
of one look-up in this table. The exact nearest neighbor problem is believed to suffer from 
the “curse of dimensionality,” meaning that a non-trivial query time (sublinear in n, say) 
requires a data structure with space exponential in d. 

There have been lots of exciting positive results for the (1 -|- e)-approximate version of 
the nearest neighbor problem, where the query algorithm is only required to return a point 
p with i(p, g) < (1 -|- €)i(p*, q), where p* is the (exact) nearest neighbor of q. This is the 
problem we discuss in these lectures. You’ll want to think of e as a not-too-small constant, 
perhaps 1 or 2. For many of the motivating applications of the nearest-neighbor problem — 
for example, the problem of detecting near-duplicate documents (e.g., to filter search results) 
— such approximate solutions are still practically relevant. 


6.3 An Upper Bound: Biased Random Inner Products 

In this section we present a non-trivial data structure for the (1 -|- e)-approximate nearest 
neighbor problem in the Hamming cube. The rough idea is to hash the Hamming cube and 
then precompute the answer for each of the hash table’s buckets — the trick is to make 
sure that nearby points are likely to hash to the same bucket. Section [6.4| proves a sense 
in which this data structure is the best possible: no data structure for the problem with 
equally fast query time uses signihcantly less space. 


6.3.1 The Key Idea (via a Public-Coin Protocol) 

For the time being, we restrict attention to the decision version of the (l-l-e)-nearest neighbor 
problem. Here, the data structure construction depends both on the point set S C {0,1}'^ 
and on a given parameter L G {0,1,2,..., d}. Given a query q, the algorithm only has to 
decide correctly between the following cases: 


1. There exists a point p £ S with i(p, q) < L. 

2. i(p, g) > (1 -|- e)L for every point p G S. 


If neither of these two cases applies, then the algorithm is off the hook (either answer is 
regarded as correct). We’ll see in Section 6.3.3 how, using this solution as an ingredient, we 
can build a data structure for the original version of the nearest neighbor problem. 

Recall that upper bounds on communication complexity are always suspect — by design, 
the computational model is extremely powerful so that lower bounds are as impressive as 
possible. There are cases, however, where designing a good communication protocol reveals 
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the key idea for a solution that is realizable in a reasonable computational model. Next is 
the biggest example of this that we’ll see in the course. 

In the special case where S contains only one point, the decision version of the (1 + e)- 
approximate nearest neighbor problem resembles two other problems that we’ve studied 
before in other contexts, one easy and one hard. 


1. Equality. Recall that when Alice and Bob just want to decide whether their inputs 
are the same or different — equivalently, deciding between Hamming distance 0 
and Hamming distance at least 1 — there is an unreasonably effective (public-coin) 
randomized communication protocol for the problem. Alice and Bob interpret the 
first 2n public coins as two random n-bits strings ri,r 2 . Alice sends the inner product 
modulo 2 of her input x with ri and r 2 (2 bits) to Bob. Bob accepts if and only 
if the two inner products modulo 2 of his input y with ri,r 2 match those of Alice. 
This protocol never rejects inputs with x = y, and accepts inputs with x 7 ^: y with 
probability 1/4. 


2. Gap-Hamming. Recall in this problem Alice and Bob want to decide between the 
cases where the Hamming distance between their inputs is at most ^ — ^/n, or at 
In Theorem 


least I -|- 


n. 


2.5 


of Section 2.7 we proved that this problem is hard 


for one-way communication protocols (via a clever reduction from Index); it is also 
hard for general communication protocols (Chakrabarti and Regev] 2012 Sherstov 


2012 Vidick 2011 1 . Note however that the gap between the two cases is very small, 
corresponding to e ~ In the decision version of the (1 -|- e)-approximate nearest 
neighbor problem, we’re assuming a constant-factor gap in Hamming distance between 
the two cases, so there’s hope that the problem is easier. 


Consider now the communication problem where Alice and Bob want to decide if the 
Hamming distance .^j 7 (x, y) between their inputs x, y G {0,1}'’* is at most L or at least 
(1 -|- e)L. We call this the c-Gap Hamming problem. We analyze the following protocol; 
we’ll see shortly how to go from this protocol to a data structure. 

1. Alice and Bob use the public random coins to choose s random strings R = {ri,..., r^} G 
{0,1}”*, where d = 0(e“^). The strings are not uniformly random: each entry is chosen 
independently, with probability 1 / 2 L of being equal to 1 . 

2 . Alice sends the s inner products (modulo 2 ) (x, ri),..., (x, r^) of her input and the 
random strings to Bob — a “hash value” /ir(x) G {0,1}^. 

3. Bob accepts if and only if the Hamming distance between the corresponding hash 
value hji(y) of his input — the s inner products (modulo 2) of y with the random 
strings in i? — differs from hpi{x) in only a “small” (TBD) number of coordinates. 
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Intuitively, the goal is to modify our randomized communication protocol for EQUALITY so 
that it continues to accept strings that are close to being equal. A natural way to do this is to 
bias the coefficient vectors significantly more toward 0 than before. For example, if x, y differ 
in only a single bit, then choosing r uniformly at random results in (x, r) ^ (y, r) mod 2 
with probability 1/2 (if and only if r has a 1 in the coordinate where x,y differ). With 
probability 1/2L of a 1 in each coordinate of r, the probability that (x, r) ^ (y, r) mod 2 is 
only 1/2L. Unlike our EQUALITY protocol, this protocol for c-Gap HAMMING has two-sided 
error. 

For the analysis, it’s useful to think of each random choice of a coordinate rji as occurring 
in two stages: in the first stage, the coordinate is deemed relevant with probability 1/L and 
irrelevant otherwise. In stage 2, Vji is set to 0 if the coordinate is irrelevant, and chosen 
uniformly from {0,1} if the coordinate is relevant. We can therefore think of the protocol as: 
(i) first choosing a subset of relevant coordinates; (ii) running the old EQUALITY protocol 
on these coordinates only. With this perspective, we see that if iH{x,y) = A, then 


PrrJ(rj,x) ^ (rj,y) mod 2] 



( 6 . 1 ) 


for every rj G i?. In (6.1), the quantity inside the outer parentheses is exactly the probability 
that at least one of the A coordinates on which x, y differ is deemed relevant. This is a 
necessary condition for the event (rj,x) ^ (rj,y) mod 2 and, in this case, the conditional 
probability of this event is exactly ^ (as in the old Equality protocol). 

The probability in (6.1) is an increasing function of A, as one would expect. Let t denote 
the probability in (6.1) when A = L. We’re interested in how much bigger this probability 
is when A is at least (1 -|- e)L. We can take the difference between these two cases and 
bound it below using the fact that 1 — x G e~^j for x G [0,1]: 


1 

2 




> ^ (1 - «■■) M*). 


Note that h(e) is a constant, depending on e only. Thus, with s random strings, if £h{^, y) < 
A then we expect ts of the random inner products (modulo 2) to be different; if .^/^(x, y) > 
(1 -|- e)A, then we expect at least (t + /i(e))s of them to be different. A routine application 
of Chernoff bounds implies the following. 


Corollary 6.1 Define the hash function hn as in the communication protocol above. If 
s = H(^ log j), then with probability at least 1—5 over the choice of R: 

(i) //£/f(x,y) <L, then £H{h{-x),h{y)) < {t+\h{e)))s. 
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(a) //£H(x,y) > (l + e)L, then £H{h{-x),h{y)) > {t+lh{e))s. 

We complete the communication protocol above by defining “small” in the third step as 
(t+ ^h{e))s. We conclude that the c-Gap Hamming problem can be solved by a public-coin 
randomized protocol with two-sided error and communication cost 0(e“^). 


6.3.2 The Data Structure (Decision Version) 

We now show how to translate the communication protocol above into a data structure 
for the (1 -|- e)-nearest neighbor problem. For the moment, we continue to restrict to the 
decision version of the problem, for an a priori known value of L G {0,1, 2,..., d}. All we 
do is precompute the answers for all possible hash values of a query (an “inverted index”). 

Given a point set P of n points in {0, l}*^, we choose a set R of s = 0(e“^ logn) random 
strings ri,..., r^ according to the distribution of the previous section (with a “1” chosen with 
probability 1/2L). We again define the hash function hn : {0,1}'^ —?• {0,1}® by setting the 
jth coordinate of /i/j(x) to (rj,x) mod 2. We construct a table with 2® = ^ buckets, 

indexed by s-bit strings^ Then, for each point p G P, we insert p into every bucket 
b G {0,1}® for which £i{{hji{p),h) < {t + ^h{e))s, where t is dehned as in the previous 
section (as the probability in (6.1) with A = L). This preprocessing requires time. 

With this data structure in hand, answering a query q G {0,1}'^ is easy: just compute 
the hash value /ifl(q) and return an arbitrary point of the corresponding bucket, or “none” 
if this bucket is empty. 

For the analysis, think of an adversary who chooses a query point q G {0,1}'^, and then 
we subsequently flip our coins and build the above data structure. (This is the most common 
way to analyze hashing, with the query independent of the random coin flips used to choose 
the hash function.) Choosing the hidden constant in the dehnition of s appropriately and 
applying Corollary 


6.1 


with <5 = ^, we find that, for every point p G P, with probability at 
least 1 — p is in h(q) (if iuip, q) < L) or is not in h(q) (if t'/i(p, q) > (1 -|- e)L). Taking 
a Union Bound over the n points of P, we hnd that the data structure correctly answers 
the query q with probability at least 1 — 

Before describing the full data structure, let’s take stock of what we’ve accomplished 
thus far. We’ve shown that, for every constant e > 0, there is a data structure for the 
decision version of the (1 -|- e)-nearest neighbor problem that uses space \ answers a 

query with a single random access to the data structure, and for every query is correct with 
high probability. Later in this lecture, we show a matching lower bound: every (possibly 
randomized) data structure with equally good search performance for the decision version 
of the (1 -|- e)-nearest neighbor problem has space \ Thus, smaller space can only be 


achieved by increasing the query time (and there are ways to do this, see e.g. Indyk (2004|) 


^Note the frightening dependence of the space on P 
not-too-small constant. 


This is why we suggested thinking of e as a 
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6.3.3 The Data Structure (Full Version) 

The data structure of the previous section is an unsatisfactory solution to the (1 + e)-nearest 
neighbor problem in two respects: 


1. In the real problem, there is no a priori known value of L. Intuitively, one would like 
to take L equal to the actual nearest-neighbor distance of a query point q, a quantity 
that is different for different q’s. 

2. Even for the decision version, the data structure can answer some queries incorrectly. 
Since the data structure only guarantees correctness with probability at least 1 — ^ 
for each query q, it might be wrong on as many as 2'^/n different queries. Thus, an 
adversary that knows the data structure’s coin flips can exhibit a query that the data 
structure gets wrong. 


The first fix is straightforward: just build ~ d copies of the data structure of Section [6. 3.2 
one for each relevant value of Given a query q, the data structure now uses binary search 
over L to compute a (1 + e)-nearest neighbor of q; see the Exercises for details. Answering 
a query thus requires 0(log(i) lookups to the data structure. This also necessitates blowing 
up the number s of random strings used in each data structure by a ©(loglogd) factor — 
this reduces the failure probability of a given lookup by a logd factor, enabling a Union 
Bound over logd times as many lookups as before]^ 

A draconian approach to the second problem is to again replicate the data structure 
above 0(d) times. Each query q is asked in all 0(d) copies, and majority vote is used to 
determine the final answer. Since each copy is correct on each of the 2'^ possible queries with 
probability at least 1 — ^ > |, the majority vote is wrong on a given query with probability 
at most inverse exponential in d. Taking a Union Bound over the 2'^ possible queries shows 
that, with high probability over the coin flips used to construct the data structure, the data 
structure answers every query correctly. Put differently, for almost all outcomes of the coin 
flips, not even an adversary that knows the coin flip outcomes can produce a query on which 
the data structure is incorrect. This solution blows up both the space used and the query 
time by a factor of 0(d). 

An alternative approach is to keep 0(d) copies of the data structure as above but, given 
a query q, to answer the query using one of the 0(d) copies chosen uniformly at random. 
With this solution, the space gets blown up by a factor of 0(d) but the query time is 
unaffected. The correctness guarantee is now slightly weaker. With high probability over 
the coin flips used to construct the data structure (in the preprocessing), the data structure 
satisfies: for every query q G {0,1}'’*, with probability at least 1 — 0(^) over the coins 


■^For L > d/{l + e), the data structure can just return an arbitrary point of P. For L — 0, when the 
data structure of Section|6.3.2|is not well defined, a standard data structure for set membership, such as a 


perfect hash table (Fredman et al. 19841, can be used. 


■^With some cleverness, this log log d factor can be avoided - see the Exercises. 
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flipped at query time, the data structure answers the query correctly (why?). Equivalently, 
think of an adversary who is privy to the outcomes of the coins used to construct the data 
structure, but not those used to answer queries. For most outcomes of the coins used in 
the preprocessing phase, no matter what query the adversary suggests, the data structure 
answers the query correctly with high probability. 


n 


To put everything together, the data structure for a fixed L (from Section 6.3.2) requires 
^ space, the first fix blows up the space by a factor of Q{dlogd), and the second 
fix blows up the space by another factor of d. For the query time, with the alternative 
implementation of the second fix, answering a query involves O(logd) lookups into the data 
structure. Each lookup involves computing a hash value, which in turn involves computing 
inner products (modulo 2) with s = 0(e“^ lognloglogd) random strings. Each such inner 
product requires 0{d) time. 

Thus, the hnal scorecard for the data structure is: 

• Space: 0{d'^ log d) ■ 

• Query time: 0{e~‘^dlognloglogd). 

For example, for the suggested parameter values oi d = nP for a constant c G (0,1) and e 
a not-too-small constant, we obtain a query time significantly better than the brute-force 
(exact) solution (which is Q{dn)), while using only a polynomial amount of space. 


6.4 Lower Bounds via Asymmetric Communication Complexity 

We now turn our attention from upper bounds for the (1 -|- e)-nearest neighbor problem 
to lower bounds. We do this in three steps. In Section |6.4.1[ we introduce a model for 
proving lower bounds on time-space trade-offs in data structures — the cell probe model. In 


Section 6.4.2 we explain how to deduce lower bounds in the cell probe model from a certain 
type of communication complexity lower bound. Section 6.4.3| applies this machinery to 
the (1 -|- e)-approximate nearest neighbor problem, and proves a sense in which the data 
structure of Section |6.3| is optimal: every data structure for the decision version of the 
problem that uses 0(1) lookups per query has space Thus, no polynomial-sized 

data structure can be both super-fast and super-accurate for this problem. 


6.4.1 The Cell Probe Model 
Motivation 

The most widely used model for proving data structure lower bounds is the cell probe 
model, introduced by Yao — two years after he developed the foundations of communication 
complexity (Yao 1979) — in the paper “Should Tables Be Sorted?” (Yao 1981),^ The point 


* Actually, what is now called the cell probe model is a bit stronger than the model proposed by 


(19811. 


Yao 
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of this model is to prove lower bounds for data structures that allow random access. To 
make the model as powerful as possible, and hence lower bounds for it as strong as possible 
(cf., our communication models), a random access to a data structure counts as 1 step in 
this model, no matter how big the data structure is. 


Some History 


To explain the title of the paper by Yao (1981), suppose your job is to store k elements from 
a totally ordered universe of n elements ({1, 2,..., n}, say) in an array of length k. The 
goal is to minimize the worst-case number of array accesses necessary to check whether or 
not a given element is in the array, over all subsets of k elements that might be stored and 
over the n possible queries. 

To see that this problem is non-trivial, suppose n = 3 and k = 2. One strategy is to store 
the pair of elements in sorted order, leading to the three possible arrays in Figure [6^ a). 
This yields a worst-case query time of 2 array accesses. To see this, suppose we want to 
check whether or not 2 is in the array. If we initially query the first array element and find 
a “1,” or if we initially query the second array element and find a “3,” then we can’t be sure 
whether or not 2 is in the array. 


(a) Sorted Array 


(b) Unsorted Array 


Figure 6.1 Should tables be sorted? For n = 3 and fc = 2, it is suboptimal to store the array 
elements in sorted order. 


Suppose, on the other hand, our storage strategy is as shown in Figure [6T| b). Whichever 
array entry is queried hrst, the answer uniquely determines the other array entry. Thus, 
storing the table in a non-sorted fashion is necessary to achieve the optimal worst-case query 
time of 1. On the other hand, if A: = 2 and n = 4, storing the elements in sorted order (and 
using binary search to answer queries) is optimalj^ 


Formal Model 


In the cell problem model, the goal is to encode a “database” D in order to answer a set 
Q of queries. The query set Q is known up front; the encoding scheme must work (in the 


^Yao (19811 also uses Ramsey theory to prove that, provided the universe size is a sufficiently (really, 


really) large function of the array size, then binary search on a sorted array is optimal. This result assumes 


that no auxiliary storage is allowed, so solutions like perfect hashing (Fredman et al. 1984 1 are ineligible. If 


the universe size is not too much larger than the array size, then there are better solutions (Fiat and Naor 


1993 Gabizon and Hassidim 2010 Gabizon and Shaltiel 20121, even when there is no auxiliary storage. 
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sense below) for all possible databases. 

For example, in the (1 + e)-approximate nearest neighbor problem, the database corre¬ 
sponds to the point set P C {0,1}'^, while the possible queries Q correspond to the elements 
of {0, l}*^. Another canonical example is the set membership problem: here, ID is a subset 
of a universe U, and each query q ^ Q asks “is i G ID?” for some element i £ U. 

A parameter of the cell probe model is the word size w] more on this shortly. Given 
this parameter, the design space consists of the ways to encode databases D as s cells of w 
bits each. We can view such an encoding as an abstract data structure representing the 
database, and we view the number s of cells as the space used by the encoding. To be a valid 
encoding, it must be possible to correctly answer every query q £ Q for the database D by 
reading enough bits of the encoding. A query-answering algorithm accesses the encoding by 
specifying the name of a cell; in return, the algorithm is given the contents of that cell. Thus 
every access to the encoding yields w bits of information. The query time of an algorithm 
(with respect to an encoding scheme) is the maximum, over databases D (of a given size) 
and queries q £ Q, number of accesses to the encoding used to answer a query. 


For example, in the original array example from Yao (1981) mentioned above, the word 
size w is [log 2 n] — just enough to specify the name of an element. The goal in Yao 


(19811 was, for databases consisting of k elements of the universe, to understand when the 
minimum-possible query time is [log 2 fe] under the constraint that the space is 

Most research on the cell-probe model seeks time-space trade-offs with respect to a fixed 
value for the word size w. Most commonly, the word size is taken large enough so that a 
single element of the database can be stored in a single cell, and ideally not too much larger 
than this. For nearest-neighbor-type problems involving n-point sets in the d-dimensional 
hypercube, this guideline suggests taking w to be polynomial in max{d, log 2 n}. 

For this choice of w, the data structure in Section [6. 3. 2| that solves the decision version of 
the (1 -|- e)-approximate nearest neighbor problem yields a (randomized) cell-probe encoding 
of point sets with space ^ and query time 1. Cells of this encoding correspond to all 

possible s = 0(e“^ logn)-bit hash values /lij(q) of a query q £ {0,1}'^, and the contents of 
a cell name an arbitrary point p £ P with hash value /i/j(p) sufficiently close (in Hamming 
distance in {0,1}®) to that of the cell’s name (or “NULL” if no such p exists). The rest of 
this lecture proves a matching lower bound in the cell-probe model: constant query time 
can only be achieved by encodings (and hence data structures) that use ^ space. 


Prom Data Structures to Communication Protocols 

Our goal is to derive data structure lower bounds in the cell-probe model from communication 
complexity lower bounds. Thus, we need to extract low-communication protocols from 
good data structures. Similar to our approach last lecture, we begin with a contrived 
communication problem to forge an initial connection. Later we’ll see how to prove lower 


®The model in 


Yao 


(19811 was a bit more restrictive — cells were required to contain names of elements 


in the database, rather than arbitrary [logjU-j-bit strings. 
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bounds for the contrived problem via reductions from other communication problems that 
we already understand well. 

Fix an instantiation of the cell probe model - i.e., a set of possible databases and possible 
queries. For simplicity, we assume that all queries are Boolean. In the corresponding 
Query-Database problem, Alice gets a query q and Bob gets a database D. (Note that in 
all natural examples, Bob’s input is much bigger than Alice’s.) The communication problem 
is to compute the answer to q on the database D. 

We made up the Query-Database problem so that the following lemma holds. 


Lemma 6.2 Consider a set of databases and queries so that there is a cell-probe eneoding 
with word size w, space s, and query time t. Then, there is a communication protocol for 
the corresponding Query-Database problem with communication at most 


t log 2 s + 

bits Alice by Bob 

The proof is the obvious simulation: Alice simulates the query-answering algorithm, sending 
at most log 2 s bits to Bob specify each cell requested by the algorithm, and Bob sends w 
bits back to Alice to describe the contents of each requested cell. By assumption, they only 
need to go back-and-forth at most t times to identify the answer to Alice’s query q. 


Lemma 6.2 reduces the task of proving data structure lower bounds to proving lower 
bounds on the communication cost of protocols for the Query-Database problem^ 


6.4.2 Asymmetric Communication Complexity 

Almost all data structure lower bounds derived from communication complexity use asym¬ 
metric communication complexity. This is just a variant of the standard two-party model 
where we keep track of the communication by Alice and by Bob separately. The most 
common motivation for doing this is when the two inputs have very different sizes, like in 
the protocol used to prove Lemma 6.2 above. 


Case Study: Index 

To get a feel for asymmetric communication complexity and lower bound techniques for it, 
let’s revisit an old friend, the INDEX problem. In addition to the application we saw earlier 
in the course, INDEX arises naturally as the Query-Database problem corresponding to 
the membership problem in data structures. 

^The two-party communication model seems strictly stronger than the data structure design problem that 
it captures — in a communication protocol, Bob can remember which queries Alice asked about previously, 
while a (static) data structure cannot. An interesting open research direction is to find communication 
models and problems that more tightly capture data structure design problems, thereby implying strong 
lower bounds. 
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Recall that an input of INDEX gives Alice an index i G {1,2,... ,n}, specified using 
~ log 2 n bits, and Bob a subset S C {l,2,...,n}, or equivalently an n-bit vectorj^ In 
Lecture 1^ we proved that the communication complexity of INDEX is II(n) for one-way 
randomized protocols with two-sided error (Theorem 2.4) — Bob must send almost his 
entire input to Alice for Alice to have a good chance of computing her desired index. This 
lower bound clearly does not apply to general communication protocols, since Alice can 
just send her log 2 n-bit input to Bob. It is also easy to prove a matching lower bound on 
deterministic and nondeterministic protocols (e.g., by a fooling set argument). 

We might expect a more refined lower bound to hold: to solve INDEX, not only do the 
players have to send at least log 2 n bits total, but more specifically Alice has to send at least 
log 2 n bits to Bob. Well not quite: Bob could always send his entire input to Alice, using n 
bits of communication while freeing Alice to use essentially no communication. Revising 
our ambition, we could hope to prove that in every INDEX protocol, either (i) Alice has to 
communicate most of her input; or (ii) Bob has to communicate most of his input. The 
next result states that this is indeed the case. 


Theorem 6.3 (Miltersen et al. 1998) For every d > 0, there exists a constant N = N{S) 
such that, for every n > N and every randomized communication protocol with two-sided 
error that solves INDEX with n-hit inputs, either: 


(i) in the worst case (over inputs and protocol randomness), Alice communicates at least 
6 log 2 n bits; or 

(ii) in the worst case (over inputs and protocol randomness). Bob communicates at least 

bits^ 


Loosely speaking. Theorem 6.3 states that the only way Alice can get away with sending 
o(logn) bits of communication is if Bob sends at least Bits of communication. 

For simplicity, we’ll prove Theorem 6.3 only for deterministic protocols. The lower 
bound for randomized protocols with two-sided error is very similar, just a little messier 
(see Miltersen et al. (1998|). 

Conceptually, the proof of Theorem 6.3 has the same flavor as many of our previous lower 
bounds, and is based on covering-type arguments. The primary twist is that rather than 
keeping track only of the size of monochromatic rectangles, we keep track of both the height 
and width of such rectangles. For example, we’ve seen in the past that low-communication 
protocols imply the existence of a large monochromatic rectangle — if the players haven’t 
had the opportunity to speak much, then an outside observer hasn’t had the chance to 
eliminate many inputs as legitimate possibilities. The next lemma proves an analog of 


®We’ve reversed the roles of the players relative to the standard description we gave in Lectures 
This reverse version is the one corresponding to the Query-Database problem induced by the membership 
problem. 

®From the proof, it will be evident that can be replaced by 


for any constant c > 1. 
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this, with the height and width of the monochromatic rectangle parameterized by the 
communication used by Alice and Bob, respectively. 


Lemma 6.4 (Richness Lemma (Miltersen et al., 1998)) Let f : X xY ^ {0)1} 
a Boolean function with corresponding X x Y 0-1 matrix M{f). Assume that: 

(1) M{ f) has at least v columns that each have at least u 1-inputs 


10 


(2) There is a deterministic protocol that computes f in which Alice and Bob always send 
at most a and b bits, respectively^^ 


Then, M{f) has a 1-rectangle Ax B with |A| > ^ and \B\ > 

The proof of Lemma [6.4| is a variation on the classic argument that a protocol computing 
a function / induces a partition of the matrix M{f) into monochromatic rectangles. Let’s 
recall the inductive argument. Let z be a transcript-so-far of the protocol, and assume by 
induction that the inputs (x, y) that lead to z form a rectangle Ax B. Assume that Alice 
speaks next (the other case is symmetric). Partition A into Aq,Ai, with the inputs 
X G A such Alice sends the bit rj next. (As always, this bit depends only on her input x 
and the transcript-so-far z.) After Alice speaks, the inputs consistent with the resulting 
transcript are either Aq x B or Ai x B — either way, a rectangle. All inputs that generate 
the same final transcript z form a monochromatic rectangle — since the protocol’s output 
is constant across these inputs and it computes the function /, / is also constant across 
these inputs. 

Now let’s rehne this argument to keep track of the dimensions of the monochromatic 
rectangle, as a function of the number of times that each of Alice and Bob speak. 

Proof of Lemma \6.4\ We proceed by induction on the number of steps of the protocol. 
Suppose the protocol has generated the transcript-so-far z and that A x R is the rectangle 
of inputs consistent with this transcript. Suppose that at least c of the columns of B have 
at least d 1-inputs in rows of A (possibly with different rows for different columns). 

For the hrst case, suppose that Bob speaks next. Partition A x B into A x Bq and 
A X Bi, where Bj^ are the inputs y € B such that (given the transcript-so-far z) Bob sends 
the bit Tj. At least one of the sets Bq, Bi contains at least c/2 columns that each contain at 
least d 1-inputs in the rows of A (Figure [6.2( a)). 

For the second case, suppose that Alice speaks. Partition Ax B into AqX B and Ai x B. 
It is not possible that both (i) Aq x B has strictly less that c/2 columns with d/2 or more 
1-inputs in the rows of Aq and (ii) Ai x has strictly less that c/2 columns with d/2 or 
more 1-inputs in the rows of Ai. For if both (i) and (ii) held, then Ax B would have less 
than c columns with d or more 1-inputs in the rows of A, a contradiction (Figure [6.2( b)). 

^'^Such a matrix is sometimes called (u, v)-rich. 

^^This is sometimes called an [a,b]-protocol. 
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c columns 
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(a) When Bob Speaks 


(b) When Alice Speaks 


Figure 6.2 Proof of the Richness Lemma (Lemma |6.4[ ). When Bob speaks, at least one of the 
corresponding subrectangles has at least c/2 columns that each contain at least d 1-inputs. When 
Alice speaks, at least one of the corresponding subrectangles has at least c/2 columns that each 
contain at least d/2 1-inputs. 


By induction, we conclude that there is a 1-input (x, y) such that, at each point of 
the protocol’s execution on (x,y) (with Alice and Bob having sent a and /3 bits so-far, 
respectively), the current rectangle A x B oi inputs consistent with the protocol’s execution 
has at least columns (in A) that each contain at least u/2" 1-inputs (among the 

rows of B). Since the protocol terminates with a monochromatic rectangle of M(f) and 
with a < a, P < b, the proof is complete. ■ 



to subsets S C {1, 2,..., n} of size n/2 and, for such a column S, consider the rows (i.e., 
indices for Alice) that correspond to the elements of S. 

Now suppose for contradiction that there is a protocol that solves INDEX in which 
Alice always sends at most a = (51og2 n bits and Bob always sends at most b = bits. 


Invoking Lemma 6.4 proves that the matrix M(Index) has a 1-rectangle of size at least 



2a+b 




= “ X C22 
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where C 2 > 0 is a constant independent of n. (We’re using here that n > N{6) is sufficiently 
large.) 

On the other hand, how many columns can there be in a 1-rectangle with rows? 

If these rows correspond to the set S C {1, 2,..., n} of indices, then every column of the 
1-rectangle must correspond to a superset of S. There are 

2?i-|S| _ 2”— 


of these. But 


~,n—hn^ 


C22”“” >2’'‘"2' 

for sufficiently large n, providing the desired contradiction. ■ 

Does the asymmetric communication complexity lower bound in Theorem |6.3| have any 


interesting implications? By Lemma 6.2 a data structure that supports membership queries 
with query time t, space s, and word size w induces a communication protocol for INDEX in 
which Alice sends at most t log 2 s bits and Bob sends at most tw bits. For example, suppose 
t = 0(1) and w at most poly-logarithmic in n. Since Bob only sends tw bits in the induced 
protocol for INDEX, he certainly does not send bits,^^ Thus, Theorem 6.3 implies 


that Alice must send at least 6 log 2 n bits in the protocol. This implies that 

t log 2 s > (5 log 2 n 

and hence s > . The good news is that this is a polynomial lower bound for every 

constant t. The bad news is that even for t = 1, this argument will never prove a super-linear 
lower bound. We don’t expect to prove a super-linear lower bound in the particular case 
of the membership problem, since there is a data structure for this problem with constant 
query time and linear space (e.g., perfect hashing ( jFredman et al. 1984[ )). For the (1 -|- e)- 
approximate nearest neighbor problem, on the other hand, we want to prove a lower bound 
of the form \ To obtain such a super-linear lower bound, we need to reduce from a 

communication problem harder than INDEX — or rather, a communication problem in which 
Alice’s input is bigger than log 2 n and in which she still reveals almost her entire input in 
every communication protocol induced by a constant-query data structure. 


{k, I')-Disjointness 

A natural idea for modifying INDEX so that Alice’s input is bigger is to give Alice multiple 
indices; Bob’s input remains an n-bit vector. The new question is whether or not for at least 


^■^Here S G (0,1) is a constant and n > N{5) is sufficiently large. Using the version of Theorem 
replaced by for an arbitrary constant c > 1, we can take S arbitrarily close to 1. 


6.3 


with 
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one of Alice’s indices, Bob’s input is a 1 


13 


This problem is essentially equivalent 
— to Disjointness. 


up to 


the details of how Alice’s input is encoded 

This section considers the special case of DiSJOiNTNESS where the sizes of the sets 
given to Alic e and Bob are restricted. If we follow the line of argument in the proof of 
Theorem 6.3 the best-case scenario is a space lower bound of 2^^“^ where a is the length 
of Alice’s input; see also the proofs of Corollary |6.8| and Theorem 6.9 at the end of the 
lecture. This is why the INDEX problem (where Alice’s set is a singleton and a = log 2 n) 
cannot lead — at least via Lemma |6.2| — to super-linear data structure lower bounds. 
The minimum a necessary for the desired space lower bound of ^ is log 2 n. This 

motivates considering instances of DiSJOiNTNESS in which Alice receives a set of size e“^. 
Formally, we define (fc, t')-DlSJOINTNESS as the communication problem in which Alice’s 
input is a set of size k (from a universe U) and Bob’s input is a set of size t (also from 
[/), and the goal is to determine whether or not the sets are disjoint (a 1-input) or not (a 
0-input). 

We next extend the proof of Theorem 6.3 to show the following. 


Theorem 6.5 (Andoni et al. 2006; Miltersen et al. 1998) For every e, (5 > 0 and ev¬ 
ery sufficiently large n > N{e,5), in every communication protocol that solves (^,n)- 
Disjointness with a universe of size 2n, either: 


(i) Alice sends at least ^ log 2 n bits; or 
(a) Bob sends at least bits. 


As with Theorem |6.3[ we’ll prove Theorem 6.5 for the special case of deterministic protocols. 


The theorem also holds for randomized protocols with two-sided error (Andoni et al. 2006) 


and we’ll use this stronger result in Theorem 6.9 below. (Our upper bound in Section 6.3.2 
is randomized, so we really want a randomized lower bound.) The proof for randomized 
protocols argues along the lines of the lower bound for the standard version of 

Disjointness, and is not as hard as the stronger D(n) lower bound (recall the discussion 


in Section 4.3.4). 


Proof of Theorem 6.5: Let M denote the 0-1 matrix corresponding to the (^, n)-DlSJOINTNESS 


function. Ranging over all subsets of U of size n, and for a given such set S, over all subsets 
of t/ \ S' of size we see that M has at least (^^) columns that each have at least (^- 2 ) 
1-inputs. 

Assume for contraction that there is a communication protocol for (^, n)-DlS JOINTNESS 
such that neither (i) nor (ii) holds. By the Richness Lemma (Lemma 6.4), there exists a 


^■^This is reminiscent of a “direct sum,” where Alice and Bob are given multiple instances of a commu¬ 
nication problem and have to solve all of them. Direct sums are a fundamental part of communication 
complexity, but we won’t have time to discuss them. 
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1-rectangle Ax B where 


1^1 = ( 1 


logn 1 5 

I > (e^n)e^ • n 


4 ( 1 -' 5 ) 


and 


\B\ = 


2n 
n 




( 6 . 2 ) 

(6.3) 


where in (6.3) we are nsing that n is snfhciently large. 

Since A x B is a rectangle, S and T are disjoint for every choice of 5 G ^ and T € B. 
This implies that Users' and [Jt£bT are disjoint sets. Letting 


we have 


s = lUseA'S'l , 


Ai < (^!j) < 


(6.4) 


Combining (6.2) and (6.4) implies that 




Since every subset T E B avoids the s elements in 


\B\ ^ ‘^n—s ^ c^n—e^rE ^ 


(6.5) 


Ineqnalities (6.3) and (6.5) fnrnish the desired contradiction. 


The npshot is that, for the goal of proving a commnnication lower bonnd of n(e“^ logn) 
(for Alice, in a Query-Database problem) and a consequent data structure space lower 
bound of \ (^, n)-DlSJOINTNESS is a promising candidate to reduce from. 

6.4.3 Lower Bound for the (1 -I- e)-Approximate Nearest Neighbor Problem 

The hnal major step is to show that (^, n)-DlSJOINTNESS, which is hard by Theorem 


6.5 


reduces to the Query-Database problem for the decision version of the (1 -)- e)-nearest 
neighbor problem. 


A Simpler Lower Bound of 

We begin with a simpler reduction that leads to a suboptimal but still interesting space 
lower bound of In this reduction, we’ll reduce from (-, n)-DlSJOINTNESS rather 

than (^, n)-DlSJOlNTNESS. Alice is given a ^-set S (from a universe U of size 2n), which 
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we need to map to a nearest-neighbor query. Bob is given an n-set T C [7, which we need 
to map to a point set. 

Our hrst idea is to map the input to (^, n)-DlSJOINTNESS to a nearest-neighbor query 
in the 2re-dimensional hypercube {0,1}^"’. Alice performs the obvious mapping of her input, 
from the set S' to a query point q that is the characteristic vector of S (which lies in {0,1}^"'). 
Bob maps his input T to the point set P = {ej : i € T}, where Oj denotes the characteristic 
vector of the singleton {i} (i.e., the ith standard basis vector). 

If the sets S and T are disjoint, then the corresponding query q has Hamming distance 
^ 1 from every point in the corresponding point set P. If S and T are not disjoint, then 

there exists a point ei ^ P such that the Hamming distance between q and e* is ^ — 1. 
Thus, the (^, n)-DlSJOINTNESS problem reduces to the (1 -|-e)-approximate nearest neighbor 
problem in the 2n-dimensional Hamming cube, where 2n is also the size of the universe 
from which the point set is drawn. 

We’re not done, because extremely high-dimensional nearest neighbor problems are not 
very interesting. The convention in nearest neighbor problems is to assume that the word 
size w — recall Section [6.4. II — is at least the dimension. When d is at least the size of the 
universe from which points are drawn, an entire point set can be described using a single 
word! This means that our reduction so far cannot possibly yield an interesting lower bound 
in the cell probe model. We fix this issue by applying dimension reduction — just as in our 
upper bound in Section 6.3.2| — to the instances produced by the above reduction. 

Precisely, we can use the following embedding lemma. 


Lemma 6.6 (Embedding Lemma There exists a randomized function f from {0,1}^"’ 

to {0,1}'’^ with d = 0(^ logn) and a constant a > 0 such that, for every set P C {0,1}^” 
of n points and query q G {0,1}^"' produced by the reduction above, with probability at least 

1-i; 

n 

(1) if the nearest-neighbor distance between q and P is j — 1, then the nearest-neighbor 
distance between /(q) and f{P) is at most a; 


(2) if the nearest-neighbor distance between q and P is j -\-l, then the nearest-neighbor 
distance between /(q) and f{P) is at least a{l -T h{e)), where h{e) > 0 is a constant 
depending on e only. 


Lemma |6.6|is an almost immediate consequence of Corollary 6.1 — the map / just takes 


d = 0(e“^ logn) random inner products with 2n-bit vectors, where the probability of a “1” 
is roughly e/2. We used this idea in Section 6.3.2 for a data structure — here we’re using it 
for a lower bound! 

Composing our initial reduction with Lemma [6.6| yields the following. 


Corollary 6.7 Every randomized asymmetric communication lower bound for (i, n)-DlSJOINTNESS 
carries over to the Query-Database problem for the (1 -|- e)-approximate nearest neighbor 
problem in d = n(e“^logn) dimensions. 
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Proof: To recap our ideas, the reduction works as follows. Given inputs to (^,n)- 
Disjointness, Alice interprets her input as a query and Bob interprets his input as a 
point set (both in {0,1}^”') as described at the beginning of the section. They use shared 
randomness to choose the function / of Lemma 6.6 and use it to map their inputs to {0,1}'^ 
with d = 0(e“^logn). They run the assumed protocol for the Query-Database problem 
for the (1 -|- e)-approximate nearest neighbor problem in d dimensions. Provided the hidden 
constant in the dehnition of d is sufficiently large, correctness (with high probability) is 


guaranteed by Lemma 6.6 (Think of Lemma 6.6 as being invoked with a parameter 


satisfying h{e') = e.) The amount of communication used by the (^, n)-DlSJOINTNESS 
protocol is identical to that of the Query-Database protocolj^ ■ 


Following the arguments of Section |6.4.2 translates our asymmetric communication 
complexity lower bound (via Lemma 6.2) to a data structure space lower bound. 


Corollary 6.8 Every data structure for the decision version of the (1 -|- e)-approximate 
nearest neighbors problem with query time t = 0(1) and word size w = for constant 

(5 > 0 uses space s = \ 


Proof: Since tw = in the induced communication protocol for the QUERY- 

Database problem (and hence (^, n)-DlSJOINTNESS, via Corollary 6.7), Bob sends a 
sublinear number of bits. Theorem 6.5 then implies that Alice sends at least D(e“^ logn) 
bits, and so (by Lemma 6.2) we have tlog 2 s = D(e“^ logn). Since t = 0(1), this implies 
that s = ■ 


The Lower Bound 

The culmination of this lecture is the following. 


Theorem 6.9 Every data structure for the decision version of the (1 -|- e)-approximate 
nearest neighbors problem with query time t = 0(1) and word size w = for constant 

(5 > 0 uses space s = 'l. 

The proof is a refinement of the embedding arguments we used to prove Corollary |6.8[ 
In that proof, the reduction structure was 

5, T C t/^ {0,1}2" ^ {0,1}'^, 


with inputs S,T of (^, n)-DlSJOINTNESS mapped to the 2n-dimensional Hamming cube and 
then to the d-dimensional Hamming cube, with d = 0(e“^ logn). 

The new plan is 


S,TCU^ (M^",Q) ^ 


^^,£i)^{0,l}^'^{0,l}^ 

led error even if tht 
in its full generality. 


^^The (t, n)-DisJOiNTNESS protocol is randomized with two-sided error even if the Query-Database 
protocol is deterministic. This highlights our need for Theorem i 


6.5 

















6.4 Lower Bounds via Asymmetric Communication Complexity 


105 


where d = 0(e“^logn) as before, and D,D' can be very large. Thus we map inputs S,T 
of (^, n)-DlSJOINTNESS to 2n-dimensional Euclidean space (with the £2 norm), which we 
then map (preserving distances) to high-dimensional space with the ii norm, then to the 
high-dimensional Hamming cube, and finally to the 0(e“^ log n)-dimensional Hamming cube 
as before (via Lemma 6.61. The key insight is that switching the initial embedding from 
the high-dimensional hypercube to high-dimensional Euclidean space achieves a nearest 
neighbor gap of 1 ± e even when Alice begins with a ^-set; the rest of the argument uses 
standard (if non-trivial) techniques to eventually get back to a hypercube of reasonable 
dimension. 

To add detail to the important first step, consider inputs S,T to (^, n)-DlSJOINTNESS. 
Alice maps her set S' to a query vector q that is e times the characteristic vector of S, which 
we interpret as a point in 2n-dimensional Euclidean space. Bob maps his input T to the 
point set P = {e* : i £ T}, again in 2n-dimensional Euclidean space, where Oj denotes the 
zth standard basis vector. 

First, suppose that S and T are disjoint. Then, the £2 distance between Alice’s query q 
and each point e* G P is 


1 + ^ 


= x/2. 


If S and T are not disjoint, then there exists a point ei £ P such that the £2 distance 
between q and Oj is: 


^(l-e)2 + (^-l) e2 = < v/2 (l - |) . 


Thus, as promised, switching to the £2 norm — and tweaking Alice’s query 
to get a 1 ± 0(e) gap in nearest-neighbor distance between the “yes” and 
of (4-, ?^)-Disjointness. This immediately yields (via Theorem 


6.5 


allows us 
no” instances 
following the proof 

of Corollary 6.81 lower bounds for the (1 -|- e)-approximate nearest neighbor problem in 
high-dimensional Euclidean space in the cell-probe model. We can extend these lower bounds 
to the hypercube through the following embedding lemma. 


Lemma 6.10 (Embedding Lemma ^2) For every 6 > 0 there exists a randomized 
function f from to {0,1}'^ (with possibly large D = D{5)) such that, for every set 
P P {0, of n points and query q G {0,1}^”' produced by the reduction above, with 
probability at least 1 — 


^h(/(p),/( q)) G (1 ± 5) • ^ 2 (p,q) 


for every p £ P. 


Thus Lemma 6.10 says that one can re-represent a query q and a set P of n points in 
in a high-dimensional hypercube so that the nearest-neighbor distance — £2 distance 
in the domain, Hamming distance in the range — is approximately preserved, with the 
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approximation factor tending to 1 as the number D of dimensions tends to infinity. The 
Exercises outline the proof, which combines two standard facts from the theory of metric 
embeddings: 

1. “L 2 embeds isometrically into Li.” For every (5 > 0 and dimension D there exists a 
randomized function / from to , where D' can depend on D and 6, such that, 
for every set P C of n points, with probability at least 1 — 

ll/(p) - /(p )lli G (1 ± 5) • Up - p'||2 

for all p, p' G P. 

2. “Li embeds isometrically into the (scaled) Hamming cube.” For every 5 > 0, there 

exists constants M = M{6) and D" = D"{D', 6) and a function g : —)■ {0,1}^" 

such that, for every set P C , 

dHigip,p')) = M • Up - p'lli ± 6 


for every p, p' G P. 


With Lemma 6.10 in hand, we can prove Theorem |6.9| by following the argument in 


Corollary 6.8 


Proof of Theorem \6.9\ By Lemmas 6.6 and |6.10l Alice and Bob can use a communication 
protocol that solves the Query-Database problem for the decision version of (l + e)-nearest 
neighbors to solve the (^, n)-DlSJOINTNESS problem, with no additional communication 
(only shared randomness, to pick the random functions in Lemmas 6.6 and 6.10) Thus, 


the (randomized) asymmetric communication lower bound for the latter problem applies 
also to the former problem. 

Since tw = in the induced communication protocol for the Query-Database 

problem (and hence (^, n)-DlSJOINTNESS), Bob sends a sublinear number of bits. Theo¬ 
rem 6.5 then implies that Alice sends at least H(e“^ log 2 n) bits, and so (by Lemma 6.2) we 
have t log 2 s = Q{e~^ log 2 n). Since t = 0(1), this implies that s = \ ■ 


“Strictly speaking, we’re using a generalization of Lemma 6.6 (with the same proof) where the query 


and point set can lie in a hypercube of arbitrarily large dimension, not just 2n. 
















Lecture 7 


Lower Bounds in Algorithmic Game Theory 


7.1 Preamble 

This lecture explains some applications of communication complexity to proving lower 
bounds in algorithmic game theory (AGT), at the border of computer science and economics. 
In AGT, the natural description size of an object is often exponential in a parameter of 
interest, and the goal is to perform non-trivial computations in time polynomial in the 
parameter (i.e., logarithmic in the description size). As we know, communication complexity 
is a great tool for understanding when non-trivial computations require looking at most of 
the input. 


7.2 The Welfare Maximization Problem 

The focus of this lecture is the following optimization problem, which has been studied in 
AGT more than any other. 

1. There are k players. 

2. There is a set M of m items. 

3. Each player i has a valuation Vi : 2^ —)■ M+. The number Vi(T) indicates i’s value, 
or willingness to pay, for the items T C M. The valuation is the private input of 
player i — i knows Vi but none of the other Vj's. We assume that Uj(0) = 0 and that 
the valuations are monotone, meaning Vi{S) < Vi(T) whenever S' C T. To avoid bit 
complexity issues, we’ll also assume that all of the Ui(T)’s are integers with description 
length polynomial in k and m. 

Note that we may have more than two players — more than just Alice and Bob. Also note 
that the description length of a player’s valuation is exponential in the number of items m. 

In the welfare-maximization problem, the goal is to partition the items M into sets 
Ti,..., Tfc to maximize, at least approximately, the welfare 

k 

i=l 
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using communication polynomial in n and m. Note this amount of communication is 
logarithmic in the sizes of the private inputs. 

The main motivation for this problem is combinatorial auctions. Already in the domain 
of government spectrum auctions, dozens of such auctions have raised hundreds of billions 
of dollars of revenue. They have also been used for other applications such as allocating 
take-off and landing slots at airports. For example, items could represent licenses for wireless 
spectrum — the right to use a certain frequency range in a certain geographic area. Players 
would then be wireless telecommunication companies. The value Vi{S) would be the amount 
of profit company i expects to be able to extract from the licenses in S. 

Designing good combinatorial auctions requires careful attention to “incentive issues,” 
making the auctions as robust as possible to strategic behavior by the (self-interested) 
participants. Incentives won’t play much of a role in this lecture. Our lower bounds for 
protocols in Section 7.4 apply even in the ideal case where players are fully cooperative. 
Our lower bounds for equilibria in Section [7.5| effectively apply no matter how incentive 
issues are resolved. 


7.3 Multi-Party Communication Complexity 
7.3.1 The Model 

Welfare-maximization problems have an arbitrary nnmber k of players, so lower bounds for 
them follow most naturally from lower bounds for multi-party communication protocols. The 
extension from two to many parties proceeds as one would expect, so we’ll breeze through 
the relevant points without much fuss. 

Suppose we want to compute a Boolean function / : {0, l}”'^ x {0, x • • • x {0,1}-'=^ 
{0,1} that depends on the k inputs xi,..., x^. We’ll be interested in the number-in-hand 
(NIH) model, where player i only knows Xj. What other model could there be, you ask? 
There’s also the stronger number-on-forehead (NOF) model, where player i knows everything 
except Xj. (Hence the name — imagine the players are sitting in a circle.) The NOF model 
is studied mostly for its connections to circuit complexity; it has few direct algorithmic 
applications, so we won’t discuss it in this course. The NIH model is the natural one for our 
purposes and, happily, it’s also much easier to prove strong lower bounds for it. 

Deterministic protocols are dehned as you would expect, with the protocol specifying 
whose turn it is speak (as a function of the protocol’s transcript-so-far) and when the 
computation is complete. We’ll use the blackboard model, where we think of the bits sent by 
each player as being written on a blackboard in public viewQ Similarly, in a nondeterministic 
protocol, the prover writes a proof on the blackboard, and the protocol accepts the input if 
and only if all k players accept the proof. 

^In the weaker message-passing model, players communicate by point-to-point messages rather than via 
broadcast. 
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7.3.2 The Multi-Disjointness Problem 


We need a problem that is hard for multi-party communication protocols. An obvious idea is 
to use an analog of DiSJOINTNESS. There is some ambiguity about how to define a version 
of Disjointness for three or more players. For example, suppose there are three players, 
and amongst the three possible pairings of them, two have disjoint sets while the third 
have intersecting sets. Should this count as a “yes” or “no” instance? We’ll skirt this issue 
by worrying only about unambiguous inputs, that are either “totally disjoint” or “totally 
intersecting.” 

Formally, in the MULTI-DiSJOINTNESS problem, each of the k players i holds an input 
Xj G {0, !}"■. (Equivalently, a set Si C {1, 2,..., n}.) The task is to correctly identify inputs 
that fall into one of the following two cases: 

(1) “Totally disjoint,” with Si n Sj/ = 0 for every i ^ i'. 

(0) “Totally intersecting,” with / 0. 


When k = 2, this is just Disjointness. When k > 2, there are inputs that are neither 
1-inputs nor 0-inputs. We let protocols off the hook on such ambiguous inputs — they can 
answer “1” or “0” with impunity. 

In the next section, we’ll prove the following communication complexity lower bound 
for Multi-Disjointness, credited to Jaikumar Radhakrishnan and Venkatesh Srinivasan 
in 


Nisan (2002|. 


Theorem 7.1 The nondeterministic communication complexity of MULTI-DiSJOINTNESS, 
with k players with n-bit inputs, is Tt{n/k). 


The nondeterministic lower bound is for verifying a 1-input. (It is easy to verify a 0-input — 
the prover just suggests the index of an element r in H^^iSi, the validity of which is easily 
checked privately by each of the players.) 

In our application in Section [7(4| we’ll be interested in the case where k is much smaller 
than n, such as k = 0(logn). Intuition might suggest that the lower bound should be 
D(n) rather than Q{n/k), but this is incorrect — a slightly non-trivial argument shows that 
Theorem 7.1 is tight for nondeterministic protocols (for all small enough k, like k = 0{y/n)). 
See the Homework for details. This factor-A: difference won’t matter for our applications, 
however. 


7.3.3 Proof of Theorem 


7.1 


The proof of Theorem 7.1 has three steps, all of which are generalizations of familiar 
arguments. 


Step 1: Every deterministic protocol with communication cost c induces a partition of M{f) 
into at most 2^ monochromatic boxes. By “M(f)f we mean the /c-dimensional array in 
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which the zth dimension is indexed by the possible inputs of player i, and an array entry 
contains the value of the function / on the corresponding joint input. By a “box,” we mean 
the /c-dimensional generalization of a rectangle — a subset of inputs that can be written as 
a product Ai x A 2 x ■ ■ ■ x A^. By “monochromatic,” we mean a box that does not contain 
both a 1-input and a 0-input. (Recall that for the MULTI-DiSJOINTNESS problem there are 
also wildcard (“*”) inputs — a monochromatic box can contain any number of these.) 

The proof of this step is the same as in the two-party case. We just run the protocol and 
keep track of the joint inputs that are consistent with the transcript. The box of all inputs is 
consistent with the empty transcript, and the box structure is preserved inductively: when 
player i speaks, it narrows down the remaining possibilities for the input Xj, but has no 
effect on the possible values of the other inputs. Thus every transcript corresponds to a box, 
with these boxes partitioning M(/). Since the protocol’s output is constant over such a box 
and the protocol computes /, all of the boxes it induces are monochromatic with respect to 

M{f). 

Similarly, every nondeterministic protocol with communication cost c (for verifying 
1-inputs) induces a cover of the 1-inputs of M{f) by at most 2^ monochromatic boxes. 


Step 2 : The number of 1-inputs in M{f) is {k -|- I)"". This step and the next are easy 
generalizations of our second proof of our nondeterministic communication complexity lower 
bounds for DiSJOINTNESS (from Section 5.4.5): first we lower bound the number of 1-inputs, 
then we upper bound the number of 1-inputs that can coexist in a single 1-box. In a 1-input 
(xi,..., Xfc), for every coordinate i, at most one of the k inputs has a 1 in the £th coordinate. 
This yields A: -|- 1 options for each of the n coordinates, thereby generating a total of {k -T I)” 
1-inputs. 


Step 3 : The number of 1-inputs in a monochromatic box is at most kT. Let B = Ai x 
A 2 X • • • X Afc be a 1-box. The key claim here is: for each coordinate i= 1,..., n, there is a 
player i G {1,..., A:} such that, for every input x* G Aj, the £th coordinate of Xj is 0. That 
is, to each coordinate we can associate an “ineligible player” that, in this box, never has a 1 
in that coordinate. This is easily seen by contradiction: otherwise, there exists a coordinate 
i such that, for every player i, there is an input Xj G Ai with a 1 in the £th coordinate. As a 
box, this means that B contains the input (xi,... ,Xfc). But this is a 0-input, contradicting 
the assumption that R is a 1-box. 

The claim implies the stated upper bound. Every 1-input of B can be generated by 
choosing, for each coordinate i, an assignment of at most one “1” in this coordinate to one 
of the A: — 1 eligible players for this coordinate. With only k choices per coordinate, there 
are at most A:” 1-inputs in the box B. 


Conclusion: Steps 2 and 3 imply that covering of the Is of the A:-dimensional array of 
the Multi-Disjointness function requires at least (1 -|- |)”' 1-boxes. By the discussion 
in Step 1, this implies a lower bound of nlog 2 (l -|- j;) = Q{n/k) on the nondeterministic 
communication complexity of the MULTI-DiSJOINTNESS function (and output 1). This 
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concludes the proof of Theorem 7.1 


Remark 7.2 (Randomized Communication Complexity of Multi-Disjointness) 

Randomized protocols with two-sided error also require communication Q{n/k) to solve 
Multi-Disjointness (Gronemeier 2009; Chakrabarti et al. 20031^ This generalizes the 
D(n) lower bound that we stated (but did not prove) in Theorem 4.11, so naturally we’re 
not going to prove this lower bound either. Extending the lower bound for DiSJOINTNESS 
to Multi-Disjointness requires significant work, but it is a smaller step than proving 


|1992t Razborov 

1992 

|. This is especially true if one settles for the weaker lower bound of 

D(n/A:^) ( 

Alon et al. 

1999 

), which is good enough for our purposes in this lecture. 


7.4 Lower Bounds for Approximate Welfare Maximization 
7.4.1 General Valuations 

We now put Theorem |7.1| to work and prove that it is impossible to obtain a non-trivial 
approximation of the general welfare-maximization problem with a subexponential (in m) 
amount of communication. First, we observe that a fc-approximation is trivial. The protocol 
is to give the full set of items M to the player with the largest Vi{M). This protocol 
can clearly be implemented with a polynomial amount of communication. To prove the 
approximation guarantee, consider a partition Ti,... ,Tfc of M with the maximum-possible 
welfare W*. There is a player i with Vi{Ti) > W*/k. The welfare obtained by our simple 
protocol is at least vAM): since we assume that valuations are monotone, this is at least 
v^{Ti) > W*/k. 

To apply communication complexity, it is convenient to turn the optimization problem of 
welfare maximization into a decision problem. In the Welfare-Maximization(A:) problem, 
the goal is to correctly identify inputs that fall into one of the following two cases: 

(1) Every partition (Ti,..., Tk) of the items has welfare at most 1. 

(0) There exists a partition (Ti,..., Tk) of the items with welfare at least k. 

Clearly, communication lower bounds for Welfare-Maximization(A:) apply more generally 
to the problem of obtaining a better-than-fc-approximation of the maximum welfare. 

We prove the following. 


Theorem 7.3 (Nisan 2002) The communication complexity of Welfare-Maximization( k) 
is exp{D(m//!:^)}. 


There is also a far-from-obvious matching upper bound of 0{n/k) (|Hastad and Wigderson 2007 


Chakrabarti et al. 


20031. 
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Thus, if the number of items m is at least for some e > 0, then the communication 
complexity of the Welfare-Maximization(A:) problem is exponential. Because the proof is 
a reduction from MULTI-DiSJOINTNESS, the lower bound applies to deterministic protocols, 
nondeterministic protocols (for the output 1), and randomized protocols with two-sided 


error. 


The proof of Theorem 7.3 relies on Theorem 7.1 and a combinatorial gadget. We 


construct this gadget using the probabilistic method. As a thought experiment, consider t 
random partitions ..., P* of M, where t is a parameter to be defined later. By a random 
partition = (Pj(,..., P^), we just mean that each of the m items is assigned to exactly 
one of the k players, independently and uniformly at random. 

We are interested in the probability that two classes of different partitions intersect: for 
all i ^ i' and j ^ i, since the probability that a given item is assigned to i in and also to 
we have 


z' in P 




Pr 


n/^ = 


' ^ A:2 


< e 


—mfk^ 


Taking a Union Bound over the k choices for i and i' and the t choices for j and i, we have 


Pr 


P/nPj = 




(7.1) 


Call P^,..., P* an intersecting family if P- n P^^ 7 ^ 0 whenever i 7 ^ z', j 7 ^ 1. By (7.1), the 
probability that our random experiment fails to produce an intersecting family is less than 1 


provided t < fe 


1 „m/2k3 


The following lemma is immediate. 


Lemma 7.4 For every m,k > 1, there exists an intersecting family of partitions P^,..., P^ 
with t = exp{fl(m//c2)}. 


A simple combination of Theorem 7.1 and Lemma 7.4 proves Theorem 7.3 


Proof of Theorem \7.^ The proof is a reduction from MULTI-DiSJOINTNESS. Fix k and 
m. (To be interesting, m should be signihcantly bigger than k"^.) Let (Si,..., Sk) denote 
an input to MuLTl-DiSJOiNTNESS with t-bit inputs, where t = exp{n(m//i;^)} is the same 


value as in Lemma 7.4 We can assume that the players have coordinated in advance on an 
intersecting family of t partitions of a set M of m items. Each player z uses this family and 
its input Si to form the following valuation: 


Vi{T) = 


if T P P/ for some j G Si 
otherwise. 


That is, player z is either happy (value 1) or unhappy (value 0), and is happy if and only if 
it receives all of the items in the corresponding class P/ of some partition P^ with index j 
belonging to its input to MuLTl-DiSJOiNTNESS. The valuations vi,... ,Vk define an input 
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to Welfare-Maximization(A:). Forming this input requires no communication between 
the players. 

Consider the case where the input to MuLTl-DiSJOiNTNESS is a 1-input, with SiCiSi/ = 0 
for every i ^ i'. We claim that the induced input to Welfare-Maximization(A:) is a 
1-input, with maximum welfare at most 1. To see this, consider a partition (Ti,... ,Tfc) in 
which some player i is happy (with Vi{Ti) = 1). For some j ^ Si, player i receives all the 
items in . Since j 0 Si' for every i' ^ i, the only way to make a second player i' happy 
is to give it all the items in in some other partition P^ with i G S*/ (and hence i ^ j). 
Since P^,P^ is an intersecting family, this is impossible — P/ and P^, overlap for every 
^ 7^ j- 

When the input to MULTI-DiSJOINTNESS is a 0-input, with an element r in the mutual 
intersection we claim that the induced input to Welfare-Maximization(A:) is a 

0-input, with maximum welfare at least k. This is easy to see: for i = 1,2,... ,k, assign the 
items of P[ to player i. Since r € Si for every i, this makes all k players happy. 

This reduction shows that a (deterministic, nondeterministic, or randomized) protocol for 
Welfare-Maximization(A:) yields one for Multi-Disjointness (with Tbit inputs) with 
the same communication. We conclude that the communication complexity of Welfare- 
MAXlMlZATlON(fe) is Q(t/k) = exp{n(m//c^)}. ■ 


7.4.2 Subadditive Valuations 


To an algorithms person. Theorem 7.3 is depressing, as it rules out any non-trivial positive 
results. A natural idea is to seek positive results by imposing additional structure on 
players’ valuations. Many such restrictions have been studied. We consider here the case 
of subadditive valuations, where each Vi satisfies Vi{S U T) < Vi{S) + Vi{T) for every pair 
S,TC M. 


Our reduction in Theorem 7.3 immediately yields a weaker inapproximability result 
for welfare maximization with subadditive valuations. Formally, define the Welfare- 
Maximization(2) problem as that of identifying inputs that fall into one of the following 
two cases: 


(1) Every partition (Ti,..., Tfc) of the items has welfare at most k + 1. 

(0) There exists a partition (Ti,..., T^) of the items with welfare at least 2k. 

Communication lower bounds for WELFARE-Maximization(2) apply to the problem of 
obtaining a better-than-2-approximation of the maximum welfare. 


Corollary 7.5 (Dobzinski et al. 2010) The communication complexity of Welfare- 
Maximization('2J is 

exp{n(m/A:^)}, even when all players have subadditive valuations. 
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Proof: Picking up where the reduction in the proof of Theorem |7.5| left off, every player i 
adds 1 to its valuation for every non-empty set of items. Thus, the previously 0-1 valu¬ 
ations become 0-1-2 valuations that are only 0 for the empty set. Such functions always 
satisfy the subadditivity condition {vi{S U T) < Vi{S) + Vi{T)). 1-inputs and 0-inputs of 
Multi-DisJOINTNESS now become 1-inputs and 0-inputs of Welfare-Maximization(2), 
respectively. The communication complexity lower bound follows. ■ 


There is also a quite non-trivial matching upper bound of 2 for deterministic, polynomial- 
communication protocols (Feige 20091. 


7.5 Lower Bounds for Equilibria 

The lower bounds of the previous section show that every protocol for the welfare-maximization 
problem that interacts with the players and then explicitly computes an allocation has either 
a bad approximation ratio or high communication cost. Over the past five years, many 
researchers have aimed to shift the work from the protocol to the players, by analyzing the 
equilibria of simple auctions. Can such equilibria bypass the communication complexity 
lower bounds proved in Section |7.4| ? The answer is not obvious, because equilibria are 
defined non-constructively, and not through a low-communication protocol]^ 


7.5.1 Game Theory 

Next we give the world’s briefest-ever game theory tutorial. See e.g. Shoham and Leyton- 


Brown (2010|, or the instructor’s CS364A lecture notes, for a more proper introduction. 


We’ll be brief because the details of these concepts do not play a first-order role in the 
arguments below. 


Games 

A (finite, normal-form) game is specified by: 

1. A finite set of /c > 2 players. 

2. For each player i, a finite action set Aj. 

3. For each player i, a utility function rtj(a) that maps an action profile a G Ai x • • • x 

to a real number. The utility of a player generally depends not only on its action, but 
also those chosen by the other players. 


For example, in “Rock-Paper-Scissors (RPS),” there are two players, each with three actions. 


A natural choice of utility functions is depicted in Figure 7.1 


^This question was bothering your instructor back in CS364B (Winter T4) — hence, Theorem 


7.9 
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Rock 

Paper 

Scissors 

Rock 

0,0 

-1,1 

1,-1 

Paper 

1,-1 

0,0 

-1,1 

Scissors 

-1,1 

1,-1 

0,0 


Figure 7.1 Player utilities in Rock-Paper-Scissors. The pair of numbers in a matrix entry denote 
the utilities of the row and column players, respectively, in a given outcome. 


For a more complex and relevant example of a game, consider simultaneous first-price 
auctions (SIAs). There are k players. An action Oj of a player i constitutes a bid hij on 
each item j of a set M of m itemsj^ In a SIA, each item is sold separately in parallel 
using a “hrst-price auction” — the item is awarded to the highest bidder, and the price is 
whatever that player bidj3 To specify the utility functions, we assume that each player i 
has a valuation Vi as in Section |7.2| We define 


Ui{s.) = Vi{Si) - ^bij , 

value of items won ^ 

price paid for them 


where Si denotes the items on which i is the highest bidder (given the bids of a)j^ Note 
that the utility of a bidder depends both on its own action and those of the other bidders. 
Having specified the players, their actions, and their utility functions, we see that an SIA is 
an example of a game. 


Equilibria 

Given a game, how should one reason about it? The standard approach is to dehne some 
notion of “equilibrium” and then study the equilibrium outcomes. There are many useful 
notions of equilibria (see e.g. the instructor’s CS364A notes); for simplicity, we’ll stick here 
with the most common notion, (mixed) Nash equilibriaj^ 

A mixed strategy for a player i is a probability distribution over its actions — for 
example, the uniform distribution over Rock/Paper/Scissors. A Nash equilibrium is a 
collection ui,..., <7^ of mixed strategies, one per player, so that each player is performing 
a “best response” to the others. To explain, adopt the perspective of player i. We think 

■‘To keep the game finite, let’s agree that each bid has to be an integer between 0 and some known upper 
bound B. 

®You may have also heard of the Vickrey or second-price auction, where the winner does not pay their 
own bid, but rather the highest bid by someone else (the second-highest overall). We’ll stick with SlAs for 
simplicity, but similar results are known for simultaneous second-price auctions, as well. 

® Break ties in an arbitrary but consistent way. 

^For the auction settings we study, “Bayes-Nash equilibria” are more relevant. These generalize Nash 
equilibria, so our lower bounds immediately apply to them. 
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of i as knowing the mixed strategies (T_j used by the other k — 1 players (but not their 
coin flips). Thus, player i can compute the expected payoff of each action Oj G Aj, where 
the expectation assumes that the other k — 1 players randomly and independently select 
actions from their mixed strategies. Every action that maximizes i's expected utility is a 
best response to (T_j. Similarly, every probability distribution over best responses is again a 
best response (and these exhaust the best responses). For example, in Rock-Paper-Scissors, 
both players playing the uniform distribution yields a Nash equilibrium. (Every action of a 
player has expected utility 0 w.r.t. the mixed strategy of the other player, so everything is a 
best response.) 

Nash proved the following. 


Theorem 7.6 (Nash 1950) In every finite game, there is at least one Nash equilibrium. 


Theorem 7.6 can be derived from, and is essentially equivalent to, Brouwer’s Fixed-Point 
Theorem. Note that a game can have a large number of Nash equilibria— if you’re trying to 
meet a friend in New York City, with actions equal to intersections, then every intersection 
corresponds to a Nash equilibrium. 

An e-Nash equilibrium is the relaxation of a Nash equilibrium in which no player can 
increase its expected utility by more than e by switching to a different strategy. Note that 
the set of e-Nash equilibria is nondecreasing with e. Such approximate Nash equilibria seem 
crucial to the lower bound in Theorem 17.91 below. 


The Price of Anarchy 


So how good are the equilibria of various games, such as SlAs? To answer this question, we 
use an analog of the approximation ratio, adapted for equilibria. Given a game (like an SI A) 
and a nonnegative maximization objective function on the outcomes (like welfare), the price 


of anarchy (POA) (Koutsoupias and Papadimitriou 1999) is defined as the ratio between 
the objective function value of an optimal solution, and that of the worst equilibrium]^ 
If the equilibrium involves randomization, as with mixed strategies, then we consider its 
expected objective function value. 

The POA of a game and a maximization objective function is always at least 1. It is 
common to identify “good performance” of a system with strategic participants as having a 
POA close to 10 

For example, the equilibria of SlAs are surprisingly good in fairly general settings. 


^Recall that games generally have multiple equilibria. Ideally, we’d like an approximation guarantee 
that applies to all equilibria — this is the point of the POA. 

®An important issue, outside the scope of these notes, is the plausibility of a system reaching an 
equilibrium. A natural solution is to relax the notion of equilibrium enough so that it become “relatively 
easy” to reach an equilibrium. See e.g. the instructor’s CS364A notes for much more on this point. 
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Theorem 7.7 (Feldman et al. 2013) In every SI A with subadditive bidder valuations, 


the POA is at most 2. 


Theorem 7.7 is non-trivial and we won’t prove it here (see the paper or the instructor’s 
CS364B notes for a proof). This result is particularly impressive because achieving an 
approximation factor of 2 for the welfare-maximization problem with subadditive bidder 


valuations by any means (other than brute-force search) is not easy (see Feige (2009|). 


A recent result shows that the analysis of Feldman et al. (2013) is tight. 


Theorem 7.8 (Christodoulou et al. 2013) The worst-case POA of SI As with subaddi¬ 


tive bidder valuations is at least 2. 


The proof of Theorem 7.8 is an ingenious explicit construction — the authors exhibit a 
choice of subadditive bidder valuations and a Nash equilibrium of the corresponding SIA so 
that the welfare of this equilibrium is only half of the maximum possible. One reason that 


proving results like Theorem 7.8 is challenging is that it can be difficult to solve for a (bad) 


equilibrium of a complex game like a SIA. 


7.5.2 Price-of-Anarchy Lower Bounds from Communication Complexity 


Theorem |7.7| motivates an obvious question: can we do better? Theorem |7.8| implies that 
the analysis in Feldman et al. (2013) cannot be improved, but can we reduce the POA 


by considering a different auction? Ideally, the auction would still be “reasonably simple” 
in some sense. Alternatively, perhaps no “simple” auction could be better than SI As? If 
this is the case, it’s not clear how to prove it directly — proving lower bounds via explicit 
constructions auction-by-auction does not seem feasible. 


Perhaps it’s a clue that the POA upper bound of 2 for SlAs (Theorem 7.7) gets stuck at 


the same threshold for which there is a lower bound for protocols that use polynomial commu¬ 


nication (Theorem 7.5). It’s not clear, however, that a lower bound for low-communication 
protocols has anything to do with equilibria. In the spirit of the other reductions that we’ve 
seen in this course, can we extract a low-communication protocol from an equilibrium? 


Theorem 7.9 (Roughgarden 2014) Fix a class V of possible bidder valuations. Suppose 


there exists no nondeterministic protocol with subexponential (in m) communication for the 
1-inputs of the following promise version of the welfare-maximization problem with bidder 
valuations in V: 


(1) Every allocation has welfare at most W*fa. 

(0) There exists an allocation with welfare at least W*. 

Let e be bounded below by some inverse polynomial function of n and m. Then, for every 
auction with sub-doubly-exponential (in m) actions per player, the worst-case POA of e-Nash 
equilibria with bidder valuations in V is at least a. 
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Theorem 7.9 says that lower bounds for nondeterministic protocols carry over to all “suffi¬ 
ciently simple” auctions, where “simplicity” is measured by the number of actions available 
to each player. These POA lower bounds follow from communication complexity lower 
bounds, and do not require any new explicit constructions. 

To get a feel for the simplicity constraint, note that SI As with integral bids between 0 
and B have {B + I)™ actions per player — singly exponential in m. On the other hand, in 
a “direct-revelation” auction, where each bidder is allowed to submit a bid on each bundle 
S' C M of items, each player has a doubly-exponential (in m) number of actions 

The POA lower bound promised by Theorem |7.9| is only for e-Nash equilibrium; since 
the POA is a worst-case measure and the set of e-Nash equilibria is nondecreasing with e, 
this is weaker than a lower bound for exact Nash equilibria, ft is an open question whether 


or not Theorem |7.9| holds also for the POA of exact Nash equilibria. Arguably, Theorem 7.9 


is good enough for all practical purposes — a POA upper bound that holds for exact Nash 
equilibria and does not hold (at least approximately) for e-Nash equilibria with very small e 
is too brittle to be meaningful. 

Theorem |7.9| has a number of interesting corollaries. First, since SlAs have only a 


singly-exponential (in m) number of actions per player. Theorem 7.9 applies to them. Thus, 
combining it with Theorem |7.5| recovers the POA lower bound of Theorem |7.8| — modulo 
the exact vs. approximate Nash equilibria issue — and shows the optimality of the upper 
bound in Theorem 7.7 without an explicit construction. More interestingly, this POA lower 
bound of 2 (for subadditive bidder valuations) applies not only to SlAs, but more generally 
to all auctions in which each player has a sub-doubly-exponential number of actions. Thus, 
SlAs are in fact optimal among the class of all such auctions when bidders have subadditive 
valuations (w.r.t. the worst-case POA of e-Nash equilibria). 

We can also combine Theorem |7.9| with Theorem |7.3| to prove that no “simple” auction 
gives a non-trivial (better than k-) approximation for general bidder valuation. Thus with 
general valuations, complexity is essential to any auction format that offers good equilibrium 
guarantees. 


7.5.3 Proof of Theorem 7.9 


Presumably, the proof of Theorem 7.9 extracts a low-communication protocol from a good 


POA bound. The hypothesis of Theorem 7.9 offers the clue that we should be looking to 
construct a nondeterministic protocol. So what could we use an all-powerful prover for? 
We’ll see that a good role for the prover is to suggest a Nash equilibrium to the players. 

Unfortunately, it’s too expensive for the prover to even write down the description 
of a Nash equilibrium, even in SlAs. Recall that a mixed strategy is a distribution over 
actions, and that each player has an exponential (in m) number of actions available in a 
SI A. Specifying a Nash equilibrium thus requires an exponential number of probabilities. 

^''Equilibria can achieve the optimal welfare in direct-revelation mechanisms, so the bound in Theorem |7.9| 
on the number of actions is necessary. See the Exercises for further details. 
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To circumvent this issue, we resort to e-Nash equilibria, which are guaranteed to exist even 
if we restrict ourselves to distributions with small descriptions. 


Lemma 7.10 (Lipton et al. 2003) For every e > 0 and every game with k players with 
action sets Ai,..., A^, there exists an e-Nash equilibrium with description length polynomial 
in k, log(max^^^ 1^*1); \- 


We give the high-level idea of the proof of Lemma 7.10[ see the Exercises for details. 


1. Let (cJi,... ,iTfc) be a Nash equilibrium. (One exists, by Nash’s Theorem.) 

2. Run T independent trials of the following experiment: draw actions ~ ui,..., ~ 

cjfc for the k players independently, according to their mixed strategies in the Nash 
equilibrium. 

3. For each i, define dj as the empirical distribution of the a^’s. (With the probability 
of a* in dj equal to the fraction of trials in which i played Oj.) 

4. Use Chernoff bounds to prove that, if T is at least a sufficiently large polynomial 

in k, log(maxjL^ and then with high probability (di,..., dfc) is an e-Nash 

equilibrium. Note that the natural description length of (di,... ,dfc) — for example, 
just by listing all of the sampled actions — is polynomial in n, log(max(T^ and 


The intuition is that, for T sufficiently large, expectations with respect to Uj and with respect 
to di should be roughly the same. Since there are \Ai\ relevant expectations per player (the 
expected utility of each of its actions) and Chernoff bounds give deviation probabilities that 
have an inverse exponential form, we might expect a log \Ai\ dependence to show up in the 
number of trials. 


We now proceed to the proof of Theorem 7.9 


Proof of Theorem 7.9' Fix an auction with at most A actions per player, and a value for 
e = f2(l/poly(A:,m)). Assume that, no matter what the bidder valuations ui,... G V 
are, the POA of e-Nash equilibria of the auction is at most p < a. We will show that A 
must be doubly-exponential in m. 

Consider the following nondeterministic protocol for computing a 1-input of the welfare- 
maximization problem — for convincing the k players that every allocation has welfare 
at most W* ja. See also Figure [T^ The prover writes on a publicly visible blackboard 
an e-Nash equilibrium {ai ,..., cr^) of the auction, with description length polynomial in k, 
log A, and ^ = 0(poly(/c, m)) as guaranteed by Lemma 


7.10 


The prover also writes down 
the expected welfare contribution E[ui(S')] of each bidder i in this equilibrium. 

Given this advice, each player i verifies that Uj is indeed an e-best response to the other 
cTj’s and that its expected welfare is as claimed when all players play the mixed strategies 
cTi,..., cjfc. Crucially, player i is fully equipped to perform both of these checks without any 












120 


Lower Bounds in Algorithmic Game Theory 



if E[welfare(x)] < W/a jf E[welfare(x)] > W/a 
then OPT < pW/a < W then OPT > W/a 
(so case (i)) (so case (ii)) 


Figure 7.2 Proof of Theorem |7.9| How to extract a low-communication nondeterministic protocol 
from a good price-of-anarchy bound. 


communication — it knows its valuation Vi (and hence its utility in each outcome of the 
game) and the mixed strategies used by all players, and this is all that is needed to verify 
the e-Nash equilibrium conditions that apply to it and to compute its expected contribution 
to the welfare P] Player i accepts if and only if the prover’s advice passes these two tests, 
and if the expected welfare of the equilibrium is at most W*/a. 

For the protocol correctness, consider first the case of a 1-input, where every allocation 
has welfare at most W*/a. If the prover writes down the description of an arbitrary e-Nash 
equilibrium and the appropriate expected contributions to the social welfare, then all of the 
players will accept (the expected welfare is obviously at most W* ja). We also need to argue 
that, for the case of a 0-input — where some allocation has welfare at least W* — there 
is no proof that causes all of the players to accept. We can assume that the prover writes 
down an e-Nash equilibrium and its correct expected welfare W, since otherwise at least one 
player will reject. Since the maximum-possible welfare is at least W* and (by assumption) 
the POA of e-Nash equilibria is at most p < a, the expected welfare of the given e-Nash 
equilibrium must satisfy W > W*jp > Wja. Since the players will reject such a proof, we 
conclude that the protocol is correct. Our assumption then implies that the protocol has 
communication cost exponential in m. Since the cost of the protocol is polynomial in k, m, 
and log A, A must be doubly exponential in m. ■ 

Conceptually, the proof of Theorem |7.9| argues that, when the POA of e-Nash equilibria 
is small, every e-Nash equilibrium provides a privately verifiable proof of a good upper bound 

^^These computations may take a super-polynomial amount of time, but they do not contribute to the 
protocol’s cost. 
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on the maximum-possible welfare. When such upper bounds require large communication, 
the equilibrium description length (and hence the number of actions) must be large. 


7.5.4 An Open Question 


While Theorems 7.5 7.7 and 7.9 pin down the best-possible POA achievable by simple 


auctions with subadditive bidder valuations, there are still open questions for other valuation 
classes. For example, a valuation Vi is submodular if it satisfies 

Vi{T U {j}) - Vi{T) < Vi{S U {j}) - Vi{S) 

for every S OT G M and j ^ T. This is a “diminishing returns” condition for set functions. 
Every submodular function is also subadditive, so welfare-maximization with the former 
valuations is only easier than with the latter. 

The worst-case POA of SlAs is exactl y ss 1,58 when bi dders have submodular 
valuations. The upper bound was proved in Syrgkanis and Tardos (20131, the lower bound 


m 


Christodoulou et al. (2013|. It is an open question whether or not there is a simple 
auction with a smaller worst-case POA. The best lower bound known — for nondeterministic 


protocols and hence, by Theorem 7.9 for the POA of e-Nash equilibria of simple auctions — is 


2e 

2e-l 


1.23. Intriguingly, there is an upper bound (slightly) better than for polynomial- 


communication protocols (Feige and Vondrak 2010[ ) — can this better upper bound also be 
realized as the POA of a simple auction? What is the best-possible approximation guarantee, 
either for polynomial-communication protocols or for the POA of simple auctions? 
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8.1 Property Testing 


We begin in this section with a brief introduction to the field of property testing. Section 8.2 
explains the famous example of “linearity testing 
canonical problem of “monotonicity testing, 


Section 8.3 gives upper bounds for the 


and Section 8.4 shows how to derive property 
testing lower bounds from communication complexity lower bounds Q These lower bounds 
will follow from our existing communication complexity toolbox (specihcally, DiSJOINTNESS); 
no new results are required. 

Let D and i? be a finite domain and range, respectively. In this lecture, D will always 
be {0,1}"', while R might or might not be {0,1}. A property is simply a set V of functions 
from D to R. Examples we have in mind include: 


1. Linearity, where V is the set of linear functions (with R a held and D a vector space 
over R). 

2. Monotonicity, where V is the set of monotone functions (with D and R being partially 
ordered sets). 


3. Various graph properties, like bipartiteness (with functions corresponding to charac¬ 
teristic vectors of edge sets, with respect to a hxed vertex set). 

4. And so on. The property testing literature is vast. See Ron (2010| for a starting point. 


In the standard property testing model, one has “black-box access” to a function / : 
D —7- i?. That is, one can only learn about / by supplying an argument x € D and receiving 
the function’s output f{x) G R. The goal is to test membership in V by querying / as few 
times as possible. Since the goal is to use a small number of queries (much smaller than 
|D|), there is no hope of testing membership exactly. For example, suppose you derive / 
from your favorite monotone function by changing its value at a single point to introduce a 
non-monotonicity. There is little hope of detecting this monotonicity violation with a small 
number of queries. We therefore consider a relaxed “promise” version of the membership 
problem. 


^Somewhat amazingly, this connection was only discovered in 2011 (Blais et al. 20121, even though the 
connection is simple and property testing is a relatively mature field. 
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Formally, we say that a function / is e-far from the property V if, for every g € V, f and 
g differ in at least e\D\ entries. Viewing functions as vectors indexed by D with coordinates 
in R, this definition says that / has distance at least e\D\ from its nearest neighbor in V 
(under the Hamming metric). Equivalently, repairing / so that it belongs to V would require 
changing at least an e fraction of its values. A function / is e-close to V if it is not e-far — 
if it can be turned into a function in V by modifying its values on strictly less than e\D\ 
entries. 

The property testing goal is to query a function / a small number of times and then 
decide if: 

1. f £V] or 

2. / is e-far from V. 

If neither of these two conditions applies to /, then the tester is off the hook — any 
declaration is treated as correct. 

A tester specifies a sequence of queries to the unknown function /, and a declaration of 
either “G "P” or “e-far from V” at its conclusion. Interesting property testing results almost 
always require randomization. Thus, we allow the tester to be randomized, and allow it to 
err with probability at most 1/3. As with communication protocols, testers come in various 
flavors. One-sided error means that functions in V are accepted with probability 1, with no 
false negative allowed. Testers with two-sided error are allowed both false positives and false 
negatives (with probability at most 1/3, on every input that satisfies the promise). Testers 
can be non-adaptive, meaning that they flip all their coins and specify all their queries up 
front, or adaptive, with queries chosen as a function of the answers to previously asked 
queries. For upper bounds, we prefer the weakest model of non-adaptive testers with 1-sided 
error. Often (though not always) in property testing, neither adaptivity nor two-sided 
error leads to more efficient testers. Lower bounds can be much more difficult to prove for 
adaptive testers with two-sided error, however. 

For a given choice of a class of testers, the query complexity of a property V is the 
minimum (over testers) worst-case (over inputs) number of queries used by a tester that 
solves the testing problem for V. The best-case scenario is that the query complexity of a 
property is a function of e only; sometimes it depends on the size of D or ii as well. 


8.2 Example: The BLR Linearity Test 


The unofficial beginning of the field of property testing is Blum et al. (1993). (For the 
official beginning, see Rubinfeld and Sudan (1996) and Goldreich et al. ( 1998[ ).) The setting 
is D = {0,1}” and R = {0,1}, and the property is the set of linear functions, meaning 
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functions / such that /(x + y) = /(x) + /(y) (over F 2 ) for all x, y G {0, 1}"']^ The BLR 
linearity test is the following: 

1. Repeat t = 0(^) times: 

a) Pick X, y G {0,1}” uniformly at random. 

b) If /(x + y) / /(x) + /(y) (over F 2 ), then REJECT. 


2. ACCEPT. 


It is clear that if / is linear, then the BLR linearity test accepts it with probability 1. 
That is, the test has one-sided error. The test is also non-adaptive — the t random choices 
of X and y can all be made up front. The non-trivial statement is that only functions that 
are close to linear pass the test with large probability. 


Theorem 8.1 (Blum et al. 1993) If the BLR linearity test accepts a function f with 


probability greater than then f is e-close to the set of linear functions. 


The modern and slick proof of Theorem 8.1 uses Fourier analysis — indeed, the elegance of 


this proof serves as convincing motivation for the more general study of Boolean functions 


from a Fourier-analytic perspective. See Chapter 1 of Donnell (2014) for a good exposition 


There are also more direct proofs of Theorem 8.1 


as m 


Blum et al. (1993). None of these 


proofs are overly long, but we’ll spend our time on monotonicity testing instead. We mention 
the BLR test for the following reasons: 


1. If you only remember one property testing result. Theorem 8.1 and the BLR linearity 
test would be a good one. 


2. The BLR test is the thin end of the wedge in constructions of probabilistically checkable 
proofs (PCPs). Recall that a language is in NP if membership can be efficiently 
verihed — for example, verifying an alleged satisfying assignment to a SAT formula 
is easy to do in polynomial time. The point of a PCP is to rewrite such a proof of 
membership so that it can be probabilistically verified after reading only a constant 
number of bits. The BLR test does exactly this for the special case of linearity testing — 
for proofs where “correctness” is equated with being the truth table of a linear function. 
The BLR test effectively means that one can assume without loss of generality that a 
proof encodes a linear function — the BLR test can be used as a preprocessing step to 
reject alleged proofs that are not close to a linear function. Subsequent testing steps 
can then focus on whether or not the encoded linear function is close to a subset of 
linear functions of interest. 

■^Equivalently, these are the functions that can be written as /(x) = Dr some ai,... ,an € 

{ 0 , 1 }. 
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3. Theorem 8.1 highlights a consistent theme in property testing 
“global 


establishing connec- 
properties of a function. Saying that a function / is 


tions between “global’ and “local 
e-far from a property V refers to the entire domain D and in this sense asserts a “global 
violation” of the property. Property testers work well when there are ubiquitous “local 
violations” of the property. Theorem 8.1 proves that, for the property of linearity, a 
global violation necessarily implies lots of local violations. We give a full proof of such 
a “global to local” statement for monotonicity testing in the next section. 


8.3 Monotonicity Testing: Upper Bounds 


The problem of monotonicity testing was introduced in Goldreich et al. (2000) and is one 
of the central problems in the field. We discuss the Boolean case, where there have been 
several breakthroughs in just the past few months, in Sections 8.3.1 and 8.3.2 We discuss 
the case of larger ranges, where communication complexity has been used to prove strong 
lower bounds, in Section 18.3.3 


8.3.1 The Boolean Case 

In this section, we take D = {0,1}" and R = {0,1}. For h G {0,1} and x_i G {0, l}””^, we 
use the notation (5, x_j) to denote a vector of {0,1}"' in which the ith bit is b and the other 
n — 1 bits are x_j. A function / : {0,1}” —>• {0,1} is monotone if flipping a coordinate of 
an input from 0 to 1 can only increase the function’s output: 


/(0,x_i) < /(l,x_i) 


for every i G {1, 2,..., n} and x_i G {0,1}"'“^. 

It will be useful to visualize the domain {0,1}” as the n-dimensional hypercube; see also 
Figure 8.1 This graph has 2"' vertices and n2"'“^ edges. An edge can be uniquely specihed 
by a coordinate i and vector x_j G {0, l}”"^ — the edge’s endpoints are then (0, x_j) and 
(l,x_j). By the ith slice of the hypercube, we mean the 2”“^ edges for which the endpoints 
differ (only) in the ith coordinate. The n slices form a partition of the edge set of the 
hypercube, and each slice is a perfect matching of the hypercube’s vertices. A function 
{0, !}”■ —7- {0,1} can be visualized as a binary labeling of the vertices of the hypercube. 

We consider the following edge tester, which picks random edges of the hypercube and 
rejects if it ever hnds a monotonicity violation across one of the chosen edges. 


1. Repeat t times: 

a) Pick i G {1, 2, ..., n} and x_j G {0,1}"'“^ uniformly at random. 

b) If /(0,x_i) > /(l,x_i) then REJECT. 


2. ACCEPT. 
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Figure 8.1 {0,1}" can be visualized as an n-dimensional hypercube. 


Like the BLR test, it is clear that the edge tester has 1-sided error (no false negatives) 
and is non-adaptive. The non-trivial part is to understand the probability of rejecting a 
function that is e-far from monotone — how many trials t are necessary and sufficient for a 
rejection probability of at least 2/3? Conceptually, how pervasive are the local failures of 
monotonicity for a function that is e-far from monotone? 

The bad news is that, in contrast to the BLR linearity test, taking t to be a constant 
(depending only on e) is not good enough. The good news is that we can take t to be only 
logarithmic in the size of the domain. 


Theorem 8.2 (Goldreich et al. 2000) For t = 0(y), the edge tester rejects every func¬ 
tion that is e-far from monotone with probability at least 2/3. 


Proof: A simple calculation shows that it is enough to prove that a single random trial of 
the edge test rejects a function that is e-far from monotone with probability at least 

Fix an arbitrary function /. There are two quantities that we need to relate to each 
other — the rejection probability of /, and the distance between / and the set of monotone 
functions. We do this by relating both quantities to the sizes of the following sets: for 
z = 1, 2,..., n, define 


l^il = {x_i e {0,1}" ^ : /(0,x_i) >/(l,x_i)}. (8.1) 

That is, Ai is the edges of the zth slice of the hypercube across which / violates monotonicity. 
By the definition of the edge tester, the probability that a single trial rejects / is exactly 

n 

. ( 8 . 2 ) 

# of edges 

7^ of violations 

Next, we upper bound the distance between / and the set of monotone functions, 
implying that the only way in which the |^j|’s (and hence the rejection probability) can be 
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small is if / is close to a monotone function. To upper bound the distance, all we need to do 
is exhibit a single monotone function close to /. Our plan is to transform / into a monotone 
function, coordinate by coordinate, tracking the number of changes that we make along the 
way. The next claim controls what happens when we “monotonize” a single coordinate. 


Key Claim: Let i G {1, 2,..., n} be a coordinate. Obtain /' from / by, for each violated 
edge ((0, x_j), (1, x_j)) G Ai of the ith slice, swapping the values of / on its endpoints 
(Figure 8.2). That is, set /'(0,x_j) = 0 and /'(l,x_j) = 1. (This operation is well dehned 
because the edges of Ai are disjoint.) For every coordinate j = 1, 2,..., n, f has no more 
monotonicity violations in the jth slice than does /. 


Proof of Key Claim: The claim is clearly true for j = i: by construction, the swapping 
operation fixes all of the monotonicity violations in the ith slice, without introducing any 
new violations in the zth slice. The interesting case is when j ^ i, since new monotonicity 
violations can be introduced (cf.. Figure 8.2). The claim asserts that the overall number of 
violations cannot increase (cf.. Figure 8.2). 


1 0 0 1 




(a) Fixing the first slice 

1 0 1 1 




(b) Fixing the second slice 

Figure 8.2 Swapping values to eliminate the monotonicity violations in the zth slice. 
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We partition the edges of the jth slice into edge pairs as follows. We use to denote 
an assignment to the n — 1 coordinates other than j in which the ith coordinate is 0 , and 
the corresponding assignment in which the ith coordinate is flipped to 1. For a choice 
of we can consider the “square” formed by the vertices ( 0 ,x‘^j), ( 0 ,x^j), (IjX*^^), and 
(l,xij); see Figure 8.3 The edges ((0,x° j), (l,x°j)) and ((0, x^^), (l,xij)) belong to the 
jth slice, and ranging over the 2 ”“^ choices for x® ^ — one binary choice per coordinate 
other than i and j — generates each such edge exactly once. 


0 0 0 0 




Figure 8.3 The number of monotonicity violations on edges 63 and 64 is at least as large under / 
as under /'. 

Fix a choice of and label the edges of the corresponding square ei, 62 , 63 , 64 as in 
Figure [ 8 ^ A simple case analysis shows that the number of monotonicity violations on edges 
63 and 64 is at least as large under / as under f. If neither ei nor 62 was violated under /, 
then f' agrees with / on this square and the total number of monotonicity violations is 
obviously the same. If both 64 and 62 were violated under /, then values were swapped along 
both these edges; hence 63 (respectively, 64 ) is violated under f if and only if 64 (respectively, 
63 ) was violated under /. Next, suppose the endpoints of 64 had their values swapped, 
while the endpoints of 62 did not. This implies that /(0,x*(_j) = 1 and /(0,x(_j) = 0, and 
hence f'{0,x^j) = 0 and f(0,x^j) = 1. If the endpoints (l,x°j.) and (l,x^j.) of 62 have 
the values 0 and 1 (under both / and /'), then the number of monotonicity violations on 
63 and 64 drops from 1 to 0. The same is true if their values are 0 and 0. If their values 
are 1 and 1, then the monotonicity violation on edge 64 under / moves to one on edge 63 
under f, but the number of violations remains the same. The final set of cases, when the 
endpoints of 62 have their values swapped while the endpoints of 64 do not, is similar 

^Suppose we corrected only one endpoint of an edge to fix a monotonicity violation, rather than swapping 
the endpoint values. Would the proof still go through? 
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Summing over all such squares — all choices of — we conclude that the number of 
monotonicity violations in the jth slice can only decrease. ■ 


Now consider turning a function / into a monotone function g by doing a single pass 
through the coordinates, fixing all monotonicity violations in a given coordinate via swaps 
as in the Key Claim. This process terminates with a monotone function: immediately after 
coordinate i is treated, there are no monotonicity violations in the ith slice by construction; 
and by the Key Claim, fixing future coordinates does not break this property. The Key 
Claim also implies that, in the iteration where this procedure processes the ith coordinate, 
the number of monotonicity violations that need fixing is at most the number \Ai\ of 
monotonicity violations in this slice under the original function /. Since the procedure 
makes two modihcations to / for each monotonicity violation that it fixes (the two endpoints 
of an edge), we conclude that / can be made monotone by changing at most \Ai\ of 

its values. If / is e-far from monotone, then \ Ai\ > e2"^. Plugging this into (8.2), we 

hnd that a single trial of the edge tester rejects such an / with probability at least 


n2"-“i n’ 


as claimed. ■ 


8.3.2 Recent Progress for the Boolean Case 


An obvious question is whether or not we can improve over the query upper bound in 
Theorem 18.21 


The analysis in Theorem 8.2 of the edge tester is tight up to a constant factor 
(see Exercises), so an improvement would have to come from a different tester. There was no 
progress on this problem for 15 years, but recently there has been a series of breakthroughs 
on the problem. [Chakrabarty and Seshadhri ( 2^013 1 gave the first improved upper bounds, of 
Q(j^7/8/g3/2)j^ A year later, Chen et al. (2014| gave an upper bound of 0(n®/®/e^). Just a 
couple of months ago, Khot et al. (20151 gave a bound of ©(-y/n/e^). All of these improved 


upper bounds are for path testers. The idea is to sample a random monotone path from 
the hypercube (checking for a violation on its endpoints), rather than a random edge. One 
way to do this is: pick a random point x G {0,1}"'; pick a random number ^ between 0 and 
the number of zeroes of x (from some distribution); and obtain y from x by choosing at 
random z of x’s 0-coordinates and flipping them to 1. Given that a function that is e-far 


from monotone must have lots of violated edges (by Theorem 8.2), it is plausible that path 
testers, which aspire to check many edges at once, could be more effective than edge testers. 
The issue is that just because a path contains one or more violated edges does not imply 
that the path’s endpoints will reveal a monotonicity violation. Analyzing path testers seems 


substantially more complicated than the edge tester (Chakrabarty and Seshadhri 2013 


^The notation O(-) suppresses logarithmic factors. 
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Chen et al. 

2014 

Khot et al. 

2015 


20151. Note that path testers are non-adaptive and have 


1 -sided error. 

There have also been recent breakthroughs on the lower bound side. It has been known 


for some time that all non-adaptive testers with 1-sided error require queries (Fischer 


et al. 20021; see also the Exercises. For non- adaptive test e rs wit h two-sided error, Chen et al. 
(2014| proved a lower bound of and Chen et al. (2015) improve this to 


for every constant c > 0. Because the gap in query complexity between adaptive and non- 
adaptive testers can only be exponential (see Exercises), these lower bounds also imply that 
adaptive testers (with two-sided error) require n(logn) queries. The gap between 0{^/n) 
and n(logn) for adaptive testers remains open; most researchers think that adaptivity 
cannot help and that the upper bound is the correct answer. 

An interesting open question is whether or not communication complexity is useful for 
proving interesting lower bounds for the monotonicity testing of Boolean functions]^ We’ll 
see in Section |8.4| that it is useful for proving lower bounds in the case where the range is 
relatively large. 


8.3.3 Larger Ranges 


In this section we study monotonicity testing with the usual domain H = {0, !}"■ but with a 
range R that is an arbitrary finite, totally ordered set. Some of our analysis for the Boolean 
case continues to apply. For example, the edge tester continues to be a well-defined tester 
with 1-sided error. Returning to the proof of Theorem |8.2| we can again define each A, as 
the set of monotonicity violations — meaning /(O, x_i) > /(l,x_j) — along edges in the 
zth slice. The rejection probability again equals the quantity in (8.2). 

We need to revisit the major step of the proof of Theorem 8.2 which for Boolean 
functions gives an upper bound of |Aj| on the distance from a function / to the 

set of monotone functions. One idea is to again do a single pass through the coordinates, 
swapping the function values of the endpoints of the edges in the current slice that have 
monotonicity violations. In contrast to the Boolean case, this idea does not always result in 
a monotone function (see Exercises). 

We can extend the argument to general hnite ranges R by doing multiple passes over 
the coordinates. The simplest approach uses one pass over the coordinates, hxing all 
monotonicity violations that involve a vertex x with /(x) = 0; a second pass, hxing all 
monotonicity violations that involve a vertex x with /(x) = 1; and so on. Formalizing 
this argument yields a bound of 2|i?| XliLi 1^*1 oii tliG distance between / and the set of 
monotone functions, which gives a query bound of 0{n\R\/e) Goldreich et al. (2000). 


®At the very least, some of the techniques we’ve learned in previous lectures are useful. The arguments 
in |Chen et al.| ( [2014[ ) and |Chen et al.| ( |2015[ ) use an analog of Yao’s Lemma (Lemma |2.3[ )) to switch from 
randomized to distributional lower bounds. The hard part is then to come up with a distribution over 
both monotone functions and functions e-far from monotone such that no deterministic tester can reliably 
distinguish between the two cases using few queries to the function. 
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A divide-and-conquer approach gives a better upped bound [Dodis et ah] (1999). Assume 
without loss of generality (relabeling if necessary) that R = {0,1,..., r — 1}, and also (by 
padding) that r = 2^ for a positive integer k. The first pass over the coordinates fixes 
all monotonicity violations that involve values that differ in their most significant bit — 
one value that is less than | and one value that is at least |. The second pass fixes all 
monotonicity violations involving two values that differ in their second-highest-order bit. 
And so on. The Exercises ask you to prove that this idea can be made precise and show that 
the distance between / and the set of monotone functions is at most 21og2 \ R\ l^il- 

This implies an upper bound of 0(^log|i?|) on the number of queries used by the edge 
tester for the case of general finite ranges. The next section shows a lower bound of n(n/e) 
when \R\ = in these cases, this upper bound is the best possible, up to the logi? 

factor 0 


8.4 Monotonicity Testing: Lower Bounds 

8.4.1 Lower Bound for General Ranges 

This section uses communication complexity to prove a lower bound on the query complexity 
of testing monotonicity for sufficiently large ranges. 


Theorem 8.3 (Blais et al. (2012)) For large enough ranges R and e = 4, every (adap¬ 


tive) monotonieity tester with two-sided error uses n(n) queries. 


Note that Theorem |8.3| separates the case of a general ra nge R from the case of a Boolean 
range, where 0{y/n) queries are enough Khot et al. (2015). With the right communication 


complexity tools. Theorem 8.3 is not very hard to prove. Simultaneously with Blais et al 


(2012), Briet et al. Briet et al. (2012) gave a non-trivial proof from scratch of a similar 


lower bound, but it applies only to non-adaptive testers with 1-sided error. Communication 
complexity techniques naturally lead to lower bounds for adaptive testers with two-sided 
error. 

As always, the first thing to try is a reduction from DiSJOINTNESS, with the query 
complexity somehow translating to the communication cost. At first this might seem weird 
— there’s only one “player” in property testing, so where do Alice and Bob come from? 
But as we’ve seen over and over again, starting with our applications to streaming lower 
bounds, it can be useful to invent two parties just for the sake of standing on the shoulders 
of communication complexity lower bounds. To implement this, we need to show how a 
low-query tester for monotonicity leads to a low-communication protocol for DiSJOINTNESS. 

It’s convenient to reduce from a “promise” version of DiSJOINTNESS that is just as hard as 
the general case. In the Unique-DisJOINTNESS problem, the goal is to distinguish between 


®It is an open question to reduce the dependence on |i?|. Since we can assume that \R\ < 2" (why?), 
any sub-quadratic upper bound o(n^) would constitute an improvement. 
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inputs where Alice and Bob have sets A and B with ACiB = 0, and inputs where |Ani?| = 1. 
On inputs that satisfy neither property, any output is considered correct. The Unique- 
Disjointness problem showed up a couple of times in previous lectures; let’s review them. 
At the conclusion of our lecture on the extension complexity of polytopes, we proved that 
the nondeterministic communication complexity of the problem is n(n) using a covering 


argument with a clever inductive proof (Theorem 5.9). In our boot camp (Section 4.3.4), we 


discussed the high-level approach of Razborov’s proof that every randomized protocol for 
Disjointness with two-sided error requires D(n) communication. Since the hard probability 
distribution in this proof makes use only of inputs with intersection size 0 or 1, the lower 
bound applies also to the Unique-DisJOINTNESS problem. 


Key to the proof of Theorem 8.3 is the following lemma. 


Lemma 8.4 Fix sets A, B C U = {1, 2,..., n}. Define the function Hab : 2^ —>■ Z by 

hAB{S) = 2|S| + (-l)l^nAI ^ 


Then: 


(i) If An B = then h is monotone. 

(a) If |An R| = 1, then h is far from monotone. 


We’ll prove the lemma shortly; let’s first see how to use it to prove Theorem 8.3 Let Q 


be a tester that distinguishes between monotone functions from {0, !}"■ to R and functions 
that are |-far from monotone. We proceed to construct a (public-coin randomized) protocol 
for the Unique-Disjointness problem. 

Suppose Alice and Bob have sets A,B C {1, 2,..., n}. The idea is for both parties to 
run local copies of the tester Q to test the function hAB, communicating with each other as 
needed to carry out these simulations. In more detail, Alice and Bob first use the public 
coins to agree on a random string to be used with the tester Q. Given this shared random 
string, Q is deterministic. Alice and Bob then simulate local copies of Q query-by-query: 


1 . Until Q halts: 


a) Let S C {1, 2,..., n} be the next query that Q asks about the function 

b) Alice sends (—to Bob. 

c) Bob sends (—to Alice. 

d) Both Alice and Bob evaluate the function hAB at S, and give the result to their 
respective local copies of Q. 

^As usual, we’re not distinguishing between subsets of {1, 2,... ,n} and their characteristic vectors. 
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2. Alice (or Bob) declares “disjoint” if Q accepts the function Hab, and “not disjoint” 
otherwise. 


We first observe that the protocol is well defined. Since Alice and Bob use the same random 
string and simulate Q in lockstep, both parties know the (same) relevant query S to Hab in 
every iteration, and thus are positioned to send the relevant bits (( —and (—l)l'^'~'^l) 
to each other. Given these bits, they are able to evaluate Hab at the point S (even though 
Alice doesn’t know B and Bob doesn’t know A). 

The communication cost of this protocol is twice the number of queries used by the 
tester Q, and it doesn’t matter if Q is adaptive or not. Correctness of the protocol follows 


immediately from Lemma 8.4 with the error of the protocol the same as that of the tester Q. 


Because every randomized protocol (with two-sided error) for Unique-Disjointness has 
communication complexity n(n), we conclude that every (possibly adaptive) tester Q with 
two-sided error requires Q{n) queries for monotonicity testing. This completes the proof of 
Theorem 18.31 


Proof of Lemma 8-4' For part (i), assume that A Ci B = 9 and consider any set S C 


{1, 2,..., n} and i ^ S. Because A and B are disjoint, i does not belong to at least one of 


A or B. Recalling (8.3), in the expression hAB{S U {i}) — hAB{S), the difference between 
the hrst terms is 2, the difference in either the second terms (if i ^ A) or in the third 
terms (if i ^ B) is zero, and the difference in the remaining terms is at least -2. Thus, 
hAB{S U {z}) — hAB{S) > 0 for all S and i ^ S, and Lab is monotone. 

For part (ii), let A H R = {i}. For all S' C {1, 2,..., n} \ {i} such that \S n A\ and 
|S n B\ are both even, hAB{S U {i}) — hAB{S) = —2. If we choose such an S uniformly 
at random, then Pr[|S H A\ is even] is 1 (if A = {z}) or | (if A has additional elements, 
using the Principle of Deferred Decisions). Similarly, Pr[|S H i?| is even] > Since no 
potential element of S C {1,2, ...,n} \ {z} is a member of both A and B, these two 
events are independent and hence Pr[jS H A], jS H Rj are both even] > Thus, for at least 
1 . = 2"'/8 choices of S, hAB{S U {z}) < hAB{S). Since all of these monotonicity 


violations involve different values of Lab — in the language of the proof of Theorem |8.2[ 
they are all edges of the zth slice of the hypercube — fixing all of them requires changing 
Lab at 2"’/8 values. We conclude that Lab is g-far from a monotone function. ■ 


8.4.2 Extension to Smaller Ranges 


Recalling the dehnition (8.3) of the function Lab, we see that the proof of Theorem 8.3 


establishes a query complexity lower bound of D(n) provided the range R has size D(zz). It 
is not difficult to extend the lower bound to ranges of size Q{^/n). The trick is to consider a 
“truncated” version of Lab, call it where values of Lab less than n — Cy/n are rounded 
up to n — Cy/n and values more than n + c^/n are rounded down to ?z -|- c^/n. (Here c is a 
sufficiently large constant.) The range of has size Q{^/n) for all A,BC { 1 , 2,..., n}. 
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8.4 


We claim that Lemma 
“ the new version of Theorem 


still holds for with the “i” in case (ii) replaced by 


8.3 


then follows. Checking that case (i) in Lemma 


8.4 


still holds is easy: truncating a monotone function yields another monotone function. For 
case (ii), it is enough to show that Hab and differ in at most a fraction of their 
entries; since Hamming distance satisfies the triangle inequality, this implies that must 
be i^-far from the set of monotone functions. Finally, consider choosing S C {1, 2,..., n} 
uniformly at random: up to an ignorable additive term in {—2, —1, 0,1, 2}, the value of Hab 
lies in re ± c-^/re with probability at least y|, provided c is a sufficiently large constant (by 
Chebyshev’s inequality). This implies that Hab and h'j^^ agree on all but a fraction of 
the domain, completing the proof. 

For even smaller ranges i?, the argument above can be augmented by a padding argument 
to prove a query complexity lower bound of n(|i?p); see the Exercises. 


8.5 A General Approach 


It should be clear from the proof of Theorem |8.3| that its method of deriving property testing 
lower bounds from communication complexity lower bounds is general, and not particular 
to the problem of testing monotonicity. The general template for deriving lower bounds for 
testing a property V is: 


1. Map inputs (x, y) of a communication problem H with communication complexity at 
least c to a function such that: 

a) 1-inputs (x, y) of H map to functions h('x,y) that belong to V] 

b) 0-inputs (x, y) of H map to functions h(x,y) that are e-far from V. 


2. Devise a communication protocol for evaluating h(x,y) that has cost d. (In the proof 
d = 2.) 


of Theorem 8.3 


Via the simulation argument in the proof of Theorem 8.3, instantiating this template yields 
a query complexity lower bound of c/d for testing the property 

There are a number of other applications of this template to various property testing 
problems, such as testing if a function admits a small representation (as a sparse polynomial. 


as a small decision tree, etc.). See Blais et al. (2012); Goldreich (2013) for several examples. 


A large chunk of the property testing literature is about testing graph properties Goldreich 


et al. (1998). An interesting open question is if communication complexity can be used to 


prove strong lower bounds for such problems. 


®There is an analogous argument that uses one-way communication complexity lower bounds to derive 
query complexity lower bounds for non-adaptive testers; see the Exercises. 
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